Word Representation in NLP - Part 2

The basic ideology behind Natural Language Processing (NLP) is to teach machines to understand human language in written or spoken form and to derive meaning from it.

In the previous part of this article, we saw the following categories of word representation in NLP:

  1. Count-based methods - Sparse encodings
  2. Distributional semantics methods

Following are the approaches that will be discussed in this part:

  1. Dense encodings - Dense word embeddings
    • Singular Value Decomposition (SVD)
    • Word2Vec

Dense Encodings

Dense embeddings, commonly known as word embeddings, are the modern incarnation of the distributional semantics model.
The techniques we saw previously are all representing each feature as one dimension - forming ineffective sparse vectors. Word embeddings are low dimensional and dense mappings of words into vectors such that the similarity between them correlates with the semantic similarity between the words themselves. Similarity here would mean that the distance between two words can be represented by the distance between their corresponding vectors. For example, we are given the following sentence:

Sentence: The fool doth think he is wise, but the wise man knows himself to be a fool.

In sparse representation, the occurrence of the word think will not give you any relation to the occurrence of know. But the dense embedding of think will be closer to that of know and farther from the representation of fool in a vector space. This relationship has been depicted in figure 1.

Dense word embeddings could be categorized into two:

  1. Count based techniques (PPMI, SVD) or matrix factorization techniques
  2. Prediction based techniques (SGNS, FastText, Word2vec) or neural embeddings

Figure 1

Figure 1: Dense embeddings vector space

This representation is computationally better than previous methods which generated sparse vectors. It also provides a better generalization as we are trying to see which words a word appears around more often and in what kind of situations, thus representing it in a more generic and meaningful way, rather than randomly assigning a vector to a word. Word embeddings can be learned from a corpus and can be reused among different corpus.
It is definitely possible to gather a dataset and train a word embedding algorithm, but this article is not focused on that aspect. We will now learn about SVD and then a popular technique - Word2Vec, which is Google’s pre-trained word embedding model.

Singular Value Decompostion (SVD)

We saw statistical methods such as PMI and TF-IDF in the previous part, which can be incorporated with co-occurrence matrices to replace the normal count frequencies of words. But we are still struggling with the sparseness of the matrix. The high dimensionality of the matrix gives rise to the phenomenon of curse of dimensionality, resulting in model over-fitting. Dimensionality reduction techniques help us deal with that problem. SVD is one of the dimensionality reduction techniques. Apart from avoiding curse of dimensionality, dimensionality reduction has many other benefits too, like:

  • noise elimination
  • redundant feature removal
  • reduces the amount of data, resulting in less storage space and faster training

Let us first try to visualize how the dimensionality of a matrix is being reduced by application of dimensionality reduction techniques.
For that, we need to consider a 3D space which has some data points spread across an almost 2D area. Now isn’t it an overkill to represent the data points using three dimensions while they can be represented by two dimensions as well without much loss of information?
Figure 2

Figure 2: Data points spread in an almost elliptical region in the given 3D space

We can reduce dimensionality of a matrix by simply dropping columns. Let’s say we have a collinear column (a column is said to be collinear to other columns if it can be expressed as a linear combination of those columns) in a matrix. It is providing no extra information than the constituent columns, so we can drop it.
On the opposite spectrum, we can also achieve dimensionality reduction by forming a linear combination of multiple columns and replacing those multiple columns with this newly derived column. This is the basis of SVD.

SVD is a linear transformation technique that gives the best rank-k approximation of the original matrix.
But first, what is rank? Rank is a quantitative representation of the amount of information that is being obtained from a matrix. So we can say that a matrix having a higher rank stores a higher amount of information. A more formal definition - Rank is the number of linearly independent row (or column) vectors in a matrix. A set of vectors are said to be linearly independent if none of them can be expressed as a linear combination of the others.
For example, focus on the 3 X 3 matrix given in figure 3. Can you find a linear relationship between any two of the three rows?
Figure 3

Figure 3: Matrix 1

Here the third row r3 can be expressed as a dependent on first row r1, using the relation r3 = r1*3 + 1, hence there are only 2 linearly independent rows. Rank of matrix = 2.
Let us take another matrix of size 2 X 3.
Figure 4

Figure 4: Matrix 2

In this second matrix, both rows independent of each other, hence rank of matrix is the same as the number of rows, ie. Rank of matrix = 2.

So we are converting an m X n matrix to a lower dimension according to its rank, in order to store only the most important information and to discover interesting latent semantics in the corpus.
But how does SVD work, and how does it reduce the dimension of the matrix?
It decomposes the original matrix A into 3 matrices U, S and V such that:
Equation 1:

A = U * S * VT ,

where U and V are orthogonal matrices of dimensions m X m and n X n each, and S is a rectangular diagonal matrix with non-negative diagonal entries and dimension m X n.
No matter which kind of matrix we are taking, it can be decomposed into these 3 matrices.

Note: Orthogonal matrix is a matrix, whose rows and columns are forming an orthonormal set of vectors, ie. all the vectors should be of unit length and any pair of vectors should have dot product zero.

Figure 5

Figure 5: Decomposition of matrix A into U, S and V

The columns of U are called left singular vectors and the rows of V are called right singular vectors.The diagonal entries of S are called singular values. The singular values are unique and they represent the amount of variance in the corresponding singular vectors of the left and right matrices. The number of non-zero singular values gives the rank of the matrix.

Now you might be wondering how we are achieving dimensionality reduction from this process and why do we go through the painful procedure of decomposing the matrix. The most important feature of the complete process is that given the decomposed matrices U, S and V, we can form the original matrix A, or an approximation of it. Let us see how we can do that.

A can be expressed as:

A = (U1 * σ1 * V1) + (U2 * σ2 * V2) + (U3 * σ3 * V3) + ……………..

Figure 6

Figure 6: Matrix A as a linear combination of lower-ranked matrices

This means A can be written as a linear combination of lower-ranked matrices. Now if we arrange the terms on the RHS in a sorted order of sigma values, the smaller terms at the end will be the ones least contributing to the information stored in the matrix A. Even if we ignore these and select only the k largest values, the combination will give us a good approximation of A. This is why SVD is said to give k-rank approximations. Equation 1 then becomes:
Equation 2:

A = Uk * Sk * VkT

Here k is a hyperparameter that we can tune according to our task requirements. This reduced version of SVD is called Truncated SVD. Uk and Vk represent the first k columns taken from U and V, dropping the rest.
The most important thing to keep in mind is that it is not SVD, but the truncation of it that is reducing the dimension of the original matrix.

To visualize this statement, let’s assume our example matrix A to have dimension 10000 X 20000.
This means we will need a space equal to 10000 X 20000 entries to be able to store this matrix.
Now let us analyze its Singular Value Decomposition.
U and V will respectively have dimensions 10000 X 10000 and 20000 X 20000.
Say we have set k as 20, then the RHS of equation 1 is going to have 20 terms, each term having vectors of length 10000 and 20000.
So the total space required to store these vectors will be 20*(10000 + 20000), which is much less than the space required by the original matrix, resulting in the dimensionality reduction.

SVD has its applications in many fields of Machine learning, like latent semantic analysis, latent semantic indexing, image compression, data denoising, feature generation etc.

The power of SVD lies in the fact that it always exists and is stable, meaning there always are the 3 decomposed matrices for any given matrix. But it is computationally expensive.


Word2Vec is a method to efficiently create word embeddings. These word embeddings, as we have already seen before, help in creating association of similar meaning words in the vector space. The basic idea here is to capture the usage of a word in different contexts it appears in, and then use that to model the word representation. This is in contrast to statistical methods like one-hot encoding, where different words do not get similar representations owing to their similar usage. Let’s look at the cliched example which is put forth to understand how word analogies can easily be represented and solved using vector arithmetics through word embeddings:

King - Man + Woman ~ Queen

This shows that word analogies can be solved using vector arithmetic. So if we take the vector of king, remove the vector representing manliness from it, and add the vector for womanliness, we will get the vector for queen.
Curious phenomenon, isn’t it? It is also fascinating because we never trained the word embedding model explicitly for this kind of behavior. So it’s interesting and important to understand the intuition for Word2Vec, and understand how the word embeddings are able to model the relatedness and similarity among words.

Word2Vec is a group of shallow 2 layer neural networks that are trained to reconstruct the linguistic context of words. There are two techniques or algorithms used by Word2Vec to generate the vector of words, both involving neural networks:

  1. Continuous Bag of Words (CBOW) - This technique takes the context words of a word as input and tries to predict the word that wil occur in this neighbourhood according to the given corpus.
  2. Skip-gram - This technique is kind of reverse of the previous technique. So it takes a word as input and tries to predict the probability of each word in the vocabulary to occur in the context of the input word.

Figure 7

Figure 7: A basic overview of the architecture diagrams for Word2Vec models

There are already innumerous resources to learn the implementation of Word2Vec. But in this article, we will focus more on why word embedding works the way it works and why it is modeled that way. So it assumes a basic understanding of the Word2Vec model as a prior requirement. If you want to learn the implementation of Word2Vec in Python, you can refer to this awesome article. If you want to dive really deep into the core mathematics behind the algorithms for CBOW and Skip-gram, you can refer to this.

Note: The below section is heavily inspired by this lecture: Professor Mitesh M. Khapra, Deep Learning(CS7015): Lec 10.4

We saw that both techniques of Word2Vec are predictive models. They are taking some word(s) as input and predicting the presence of some word(s). So let’s begin the section by discussing why we are trying to predict the words when clearly our requirement was a numerical vector representation for each word. The best way to understand why the problem statement of Word2Vec has been modeled the way it is, would be to formulate a similar model ourselves and then try to understand it part by part. We will do exactly that.

For that, let’s consider a multiclass classification problem where we are given some words and we are trying to predict the next word in the series. Why do you think we are considering it a classification problem? Because we are trying to predict one of the words from the vocabulary to be the next word. But wasn’t our goal to learn word representations? So how are we actually being able to relate the task that we just defined, to our original task?
To understand the connection between both tasks, it is crucial to first model the task completely and understand the intuition through a dry run example. So for now, let us take the CBOW algorithm in consideration and see further details, and we will dig into the connection part as we move with the modeling.
As an example, consider a small dummy corpus.

Jack fell down and broke his crown.

The model architecture will be a simple feedforward neural network. Since we have considered the CBOW algorithm, we will be given a series of n-1 context words as input and we are going to predict the next word. So if we assume n to be 5, for our dummy corpus, this is how the (context, target) can be paired by passing the context window through the words:
Figure 8

Figure 8: Context and target words in example dummy corpus

( [ context ], target )
( [ jack, fell, down, and ], broke )
( [ fell, down, and, broke ], his )
( [ down, and, broke, his ], crown )

We might further need to remove all unnecessary complexities in this intuitive model by taking n=2. So now we have one input word wn-1 and we are predicting the next word wn. The input would be a one-hot vector representation of the given word. The output will be a probability distribution over all the words in the vocabulary, telling which words have a high probability of occurring next to the input word.
So the input will be the size of vocabulary |V|, because it is a one-hot representation of a single word from that vocabulary. The output will be of size |V| too, because we are predicting the probability of all words from the same vocabulary to occur as the next word for the input context word.
Figure 9

Figure 9: Simple CBOW architecture

Notations used in the architecture diagram in figure 9 have been explained below:

  • wn-1: Input word vector
  • h: Hidden layer
  • wn: Output layer, in form of probability distribution vector
  • Wcontext: Weight matrix for the weights between the input layer and the hidden layer
  • Wword: Weight matrix for the weights between the hidden layer and the output layer
  • P(wi|wn-1): Probability of occurence of word wi in context, given the input target word wn-1

For weight matrix Wcontext, its input is of size |V| and output of size N, so dimension of Wcontext is |V| x N. Similarly dimension of Wword will be N X |V|.

The input to the hidden layer is wn-1 * Wcontext. But we know that wn-1 is a one-hot vector. So only one entry will be 1 and the rest will be 0. If we say the cth entry is 1 to represent the word wn-1, the product will simply be the cth column of the weight matrix Wcontext.

h = Wcontextc

We can easily see there is this direct relationship between every input word and each of the columns of matrix Wcontext, because for every word, a unique index of the one-hot vector will be 1. Then the matrix multiplication would output the corresponding column of Wcontext. Figure 10

Figure 10: Example to show the matrix product for weight matrix and input layer

In other words, each column in Wcontext is a unique vector of length N for a unique context word in the vocabulary. So we can treat these columns as the much-awaited numerical word representations of the context words. Since these representations are the weights of the network, we don’t know their exact values yet, they need to be learnt. But well, we now we have got a little more clarity on the input side of the network we modeled - the problem has been set up in such a way that the word representations can be obtained from the weights of the trained network.
Let us then move forward to the next layer of our neural network architecture. This is a multiclass classification problem, so softmax function can be taken as output function, giving the probability distribution for all words in the vocabulary.
Equation 3:
Figure 10 where,

zi = (Wwordh)[i] = Wwordih

Note that we could write the above form of Zi, because multiplying matrices Wword and h, and then taking its ith column is the same as multiplying the ith column of Wword with h.

So after substituting the value of zi and h, the output ŷi can be written in the form of the following equation.

Equation 4:
Figure 10 Now we will focus on the zi term a little. What exactly is it trying to say? We have already seen that the hidden layer h is giving the cth column of Wcontext. So (Wword*h[i]) is essentially taking cth column of Wcontext and ith column of Wword.
So indirectly it can be seen that P(wi | wn-1) depends on the ith column of Wword. So each column of the weight matrix Wword also corresponds to a particular word in the vocabulary:

So now we have, for each word in the vocabulary, 2 different learned representations waiting for it. But the requirement was of just one representation for each word in our vocabulary. Which out of these two matrices can be used as the correct representation? Did we make some wrong assumption till here? No. This is actually because Wcontext represents the word when it appears in the context and Wword represents the word when it appears as the target word.

Having established the understanding of the weight terms, let’s move to the actual training part of the model.
The training algorithm used is backpropagation. Loss is computed using cross-entropy. The categorical cross-entropy can be written like:
Equation 5:
Figure 10 Or we can write it like the following using the properties of logarithm: Figure 10 Note: If you have problems with the loss function, refer to this.

Next we compute the gradient for the output layer:
Equation 6:
Figure 10 Note: If you want the step-by-step understanding on how to calculate the derivative of softmax function and reach the above final formula, this is a good read.

We can replace the big term on the RHS of the above equation and write ŷi from equation 4:

Equation 7:
Figure 10 Now that we have computed the gradient, we can write the update rule for the weights Wword:
Equation 8:
Figure 10 Substituting the obtained value of gradient from equation 7, we can write it in the following form:
Equation 9:
Figure 10 If we notice carefully and analyze equation 9, there is a very interesting observation to make, because of which we took the pain of going through all the mathematical derivations for gradient descent in our modeling task. ŷi , our predicted value will be between 0 to 1, since it is a probability value. On the basis of this, let’s focus on the extreme cases:

  • What happens if ŷi = 1?
    There is no update in the weight_word as we have already learnt well.
  • What happens if ŷi = 0?
    Figure 10 We see that Wword gets updated by adding a slight fraction of Wcontext. to it. So over iterations, Wword will shift closer to Wcontext. Or a more technical way of saying it - the cosine similarity between the target word representation in theWword matrix and its context word representation in the Wcontext. will increase over iterations.

Let us see the significance of this observation with an example:

Sentence 1: The car is moving on the road.
Sentence 2: The truck is moving on the road.

The words car and truck share common context words. So the CBOW algorithm will try to shift the representation of car in the Wword matrix towards the representations of the context words moving, road etc. in the Wcontext matrix. Similarly it will shift the representation of truck towards the representations of moving, road etc. too. So transitively, those two representations will also come close to each other in the Wword matrix space. This is how Word2Vec makes the words with similar meanings to have similar representation and words having opposite or different meanings to be repel each other in the vector space, hence capturing a sense of the semantics and meanings attached to the words on the basis of their usage in the corpus.
This completes our goal of connecting our modeled task with the understanding of the original Word2Vec problem statement.

We were able to successfully visualize an intuition as to how weights of the trained model represent the words meaningfully and how similar words get similar representation with Word2Vec. Keep in mind that this is not a proven derivation and just an intuition developed with an oversimplified CBOW model architecture. There is a lot more to Word2Vec, and you can find more details in this paper.
Now let us discuss some of the challenges of Word2Vec:

  • It does not warmly welcome unseen or out-of-vocabulary words. This is especially problematic in use-cases where a lot of noisy data is bound to be seen, like tweets.
  • It does not handle the sublinear level of words well. For example, if you encounter words like smarter, sharper and stronger, you will see that these are ending with er to represent comparative adjectives. So when you encounter another word like smaller, you will be able to deduce that this also might be a coparative adjective. But Word2Vec is not able to make use of such information from sub-words.
  • It makes use of only the local information and is not making use of global statistics at all.

. . .

In this article, we learnt Singular Value Decomposition, which is a dimensionality reduction technique. Then we discussed in detail about dense word embeddings and saw the intuition behind Word2Vec model architecture and how it is able to capture the meanings of the words on the basis of their usage, thus enabling us to treat words in form of vectors on which algebraic operations can be performed. We also glanced at a few challenges of Word2Vec.