Word Representation in NLP - Part 1

The basic ideology behind Natural Language Processing (NLP) is to teach machines to understand human language in written or spoken form and to derive meaning from it.

So the inputs and most of the features in NLP tasks are textual, be it words, n-grams, letters or Part of Speech (POS) tags etc. But machines can only understand numbers and not alphabets or words. Machine learning models take real valued vectors as input. So we need to translate our language to theirs, for them to make sense out of it. We need a process which maps the textual data to real valued vectors. This process is what we call word representation or feature extraction in NLP based problems.

This post provides an overview of the different possible approaches to word representation in NLP. The approaches have been loosely classified into the following categories:

  1. Count-based methods - Sparse encodings
    • Dictionary lookup
    • One-hot encoding
    • Bag of words
  2. Distributional semantics methods
    • Co-occurrence matrix
    • Pointwise Mutual Information (PMI)
    • Term Frequency - Inverse Document Frequency (TF-IDF)
  3. Dense encodings - Dense word embeddings
    • Singular Value Decomposition (SVD)
    • Word2Vec

Note that different resources might provide different categorizations, but I find this the most suitable. Regardless, I have tried to cover all the popular methods in a comprehensive manner here.
This article is the first in a series of two articles:

  1. THIS - Discusses sparse encodings and distributional semantics methods.

  2. Part 2 - Discusses in detail about dense encodings techniques.

Basic Concepts

Let us first see some basic terms that will be used in the rest of the article.

  1. Corpus - In simple language, corpus is a large collection of texts that has been accumulated from various written, spoken or other natural language sources and assembled in a structured manner.

  2. Document - Each instance in the corpus is a document. A document can contain any number of sentences.

  3. Context - A context is the neighbourhood in a specified window size, that is to be taken into consideration while finding representation of a word. For eg. for a word, words on its either side could be its context. For a sentence on the other hand, we might call its surrounding sentences as context.

  4. Vocabulary - Vocabulary is the set of unique words in the corpus.

Figure 1

Figure 1: Basic terms in NLP

. . .

Count-based methods (Sparse encodings)

Sparse encodings are the simplest approaches to word representation. They are very high dimensional representations, which only capture the the occurence and co-occurences of words in texts and would not care about any kind of semantic information. They are useful for many NLP tasks such as text classfication.

Dictionary Lookup

This is the simplest word representation approach, where we form a dictionary of all words we observed and the index of words in the dictionary is our numerifcation of interest. Let us see how to incorporate it in our word representation.

  • For this, we first take our corpus and perform pre-processing steps to extract words from it.
  • The pre-processing steps that are normally followed in NLP can be read in detail here.
  • Next, we form a dictionary by mapping each unique word with a unique lookup id.

Let us consider an example for better understanding.

Sentence 1: The fool doth think he is wise, but the wise man knows himself to be a fool.

We obtain a set of unique words after completing the pre-processing steps required for our use-case. The dictionary will look something like this:

Word Lookup ID
fool 1
doth 2
think 3

Figure 2: Part of the dictionary formed for our example

Now the numerical representation of any word can simply be obtained by looking it up in the dictionary.
It is a simple method, but it has a major disadvantage.

  • This method is assigning integer values to each word. So when a deep learning model uses these word representations as its input, it will assume that a word with higher integer value is somehow more important than another with lower integer value. But we know that this is a wrong assumption and all words need to be treated with equal importance.

One-Hot Encoding

In this method, we are going to represent each word in our vocabulary with a vector. The name one-hot correctly suggests that only one of the dimensions in the vector would be hot or high for a word. The position of this high label is what will make each vector representation unique for a word. Now let us walk through the steps required to actually construct a one-hot encoded vector representation.

  • Firstly, we will perform the pre-processing on our corpus to obtain a vocabulary of unique words from it.
  • Next, we arrange the words in a specific (let’s say ascending) order.
  • Now we form a binary vector of length equal to the size of vocabulary, such that only the digit at the same index as the index of the word in the ordered vocabulary will be 1 and the rest of the digits will be 0.

For the sake of clarity, let us go back to the example sentence we considered in the previous method.
After pre-processing and arranging the words in ascending order, we obtain the following words:

  1. doth
  2. fool
  3. he
  4. himself
  5. know
  6. man
  7. think
  8. wise

The size of vocabulary is 8, so let us try to form the one-hot encoded vector of length 8 for some of these words.

Word 1 2 3 4 5 6 7 8
doth 1 0 0 0 0 0 0 0
know 0 0 0 0 1 0 0 0
wise 0 0 0 0 0 0 0 1

Figure 3: One-hot encoded vector representation of words

In comparison to the previous dictionary lookup, we can see how the binary representation is giving equal importance to each word. But it suffers from many problems which we will discuss next.

  • What happens when the vocabulary size is substantially large, maybe millions of words? Apart from the obvious storage problem, this representation might also trigger the phenomenon of the curse of dimensionality.
  • Now the vector will likely be a sparse matrix, and it is computationally hard to model sparse data in deep learning and it is hard to generalize with sparse input.
  • Can you think of any other problem with such a representation? What happens if another word gets added to the vocabulary? And another? We will have to go ahead and change the representation of each word and add the dimension for the added word in it. So it’s not a scalable representation.
  • Now let’s try to understand a bigger problem. This representation is a vector. In vectors, we can talk about similarity between 2 vectors. The similarity measure could be cosine similarity, euclidean distance or any other similar method. Let’s consider cosine similarity between the vector representation of think, know and fool. All the cosine similarities come out to be 0, as all the one-hot vectors are orthogonal. But ideally, wouldn’t we want the vector for think to be a little more similar to know than to fool, given they are more similar?

So this model gives us no contextual and structural information, and we are essentially losing a large amount of data just like that.

Bag of Words

This method for word representation describes the occurrence of words in a vocabulary.
For this, we first form a bag of all unique words that we extract from all the documents in the corpus. A bag of words contains a vocabulary and a parameter representing the presence of each corresponding word in the vocabulary, which we will discuss in detail in a moment.
So we can think of an entry in this bag as something like this:

{'word': __some_word, 'parameter': __corresponding_parameter_value}

The reason it is called a “bag” of words is because it does not care about any kind of structure or contextual information in the words.
Let us take different reviews of a travel destination as our example:

Document 1: Munnar is the most mesmerizing hill station.
Document 2: Munnar National park is a waste of time.
Document 3: Magnificent view of the peaks in munnar.

The bag of words vector will be of size 1XN, where N=Size of vocabulary.
Now at this stage, we can decide what pre-processing steps we want to follow before forming the bag of words. So if there is no pre-processing done, the bag of words would contain everything from punctuations to repeated words like “Munnar” and “munnar”.
So let’s do a very basic pre-processing to begin with, by removing punctuations, converting all the words to lowercase and then build a bag of words from the above documents:

  • munnar
  • is
  • the
  • most
  • mesmerizing
  • hill
  • station
  • national
  • park
  • a
  • waste
  • of
  • time
  • magnificent
  • view
  • peaks
  • in

We can see the vector size is 1X17.
Now a simple way to represent each of the documents is to construct a vector the size of the vocabulary, where only the digits corresponding to the index of the word present in that document will be 1 and the rest of the digits will be 0. Essentially, we are marking the presence or absence of all words from the bag in the specific document.

Document One-hot vector
document 1 [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
document 2 [1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
document 3 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1]

Figure 4: One-hot encoded bag of words model

There are different flavours of bag of words model, depending on the numerical value that we calculate for each word. These flavours are called scoring methods.
We have formed a binary vector, giving the One-hot encoding, indicating whether a word is present in a sentence or not. But there can be various other methods to convert the bag of words into a word representation. We can have a frequency distribution vector, called Count vector, indicating how many times the word appeared in that document rather than just indicating the presence or absence of the word. Another technique is TF-IDF, which we will discuss in detail at a later stage in this article.

This is a simple and flexible feature extraction technique, allowing easier interpretability. The efficiency may depend on the different pre-processing steps that we perform and the scoring method we choose.
But it suffers from all the problems that we discussed for one-hot encoding.

So all the sparse encoding methods that we discussed give us no contextual, syntactic and structural information, and we are essentially losing a large amount of data just like that. Let us see how the next category deals with these problems.

Distributional semantic methods

We saw how the previous model completely disregards the words’ actual ordering and meaning. The Distributional semantic method is an umbrella term for any type of technique that models the meaning of a word in its vector representation. It is also called the Distributional Similarity Model.
Let us see where its intuition is coming from.
Example 1: You are given 3 sentences:

Sentence 1: Everyone loves a glass of great Pinot Noir.
Sentence 2: Pinot Noir can be drunk in a spectrum of temperatures.
Sentence 3: Pinot noir has red-fruit aromas.

By looking at the sentences and the neighborhood that the phrase Pinot Noir appears in, one can easily figure that it is a type of drink.

Example 2: You are given two sentences:

Sentence 1: The team discussed their immediate goals for next month.
Sentence 2: The company missed the annual target this appraisal season.

From these sentences, we can see that since the words target and goal are being used in a similar environment, they might have similar meanings too.

Example 3: You are given four sentences:

Sentence 1: You might eat this guava.
Sentence 2: You might eat this apple.
Sentence 3: The car is moving on the road.
Sentence 4: Thr truck is moving on the road.

Let’s see what clear observations can be drawn from this example. Firstly, guava and apple, both being fruits, are appearing in similar kinds of contexts. Similarly, car and truck, both being vehicles, are appearing in similar contexts too.
Now, it is highly unlikely that truck will appear in a context like this:

Sentence 5: Truck can be eaten with spoon and fork.

So the context of a word is helping us understand the similarity in meanings of different words.
Now if we are saying the meaning of words can be understood from their context in everyday life, we can also model the same by observing the usage. This is what Distributional semantic methods are doing. They are based on the assumption that the statistical distribution of a word or any linguistic item plays a major role in deciding its semantic conduct. In simple terms, the meaning of a word can be inferred from the words it is surrounded with in texts.

These methods are taking a sufficiently large corpus to track the usage of a word. In an over-simplified manner, we can then just count the neighbouring words for every instance of this word in our corpus. We will see the exact quantification process in detail in each of the methods we will discuss below. Now once the measures have been obtained, we can build a semantic space which will show the relationship between all the observable words of a particular language.

“You shall know a word by the company it keeps.”

Firth, J.R., 1957:11

Co-occurrence Matrix

In Linear algebra, co-occurrence matrix is a matrix that is used to store how many times a particular pair of entities did an event together.
Let’s assume we have a column dedicated to each actor in the Bollywood industry, and a row dedicated to each of the actresses and we store the number of movies actress i has ever done opposite actor j. Therefore,

Entities = Actors and actresses,  Event = Movie

What we get after populating all the elements is a co-occurrence matrix, telling us about pairings in Bollywood movies.
Similarly, we use a co-occurrence matrix in NLP, to store how many times a pair of words occurs together in a document. In this case,

Entities = Words,  Event = Occurrence in common documents

Consider the previous example regarding reviews of a travel destination. If we consider the context of the word national, we can see the words munnar, park, is, a, waste, of and time. So in the co-occurrence matrix, for the word national, we will see the entries corresponding to these words in its context to be 1 and the rest will be 0.

Figure 4

Figure 5: Co-occurrence matrix showing representation for some of the words from our example

But in reality, do all the words in a document contribute to the meaning or context of a word? Probably not. We can infer the meaning of a word by looking at a small context occurring around it.
Let us see this with another example:
You are given two sentences:

Sentence 1: You might eat this guava and then may be go check what the gigantic hounds are doing at our doorstep.
Sentence 2: You might eat this apple, the goodness of which has been brought to you directly from the orchards of Himachal.

To model the meaning of the words apple and guava, we might only find the words like eat to be important. So why even bother looking at distant context words like doorstep, hounds and so on which are not going to have any contribution to the meanings?

Hence we also define the concept of a Context Window, which tells how many neighbours of a word we are taking into consideration. So now we are looking at how many times a pair of words appears together in a context window.

To understand it better through our example, let us take window size = 2. This means, we are now considering two words to the left and two to the right for our current word. So the context words for national can be written as:

Word Context
national munnar(1), park(1), is(1)

Figure 6: Term-context matrix example

This type of a co-occurrence matrix is called Term-Context matrix, where we have defined the context. The previous one is called Term-Document matrix, because in that case the whole document is the context for the word. Now we have obtained a co-occurrence matrix of size NXN, where N=Size of vocabulary, which we can use for the word representation.

The major advantage of co-occurrence matrix is that it preserves the semantic relationships.
Let us see the drawbacks of this model.

  • Co-occurrence matrix is less sparse than one-hot encoding but it is still sparse, because not all words will appear in a word’s context. Hence many of the entries will be 0.
  • Other than that, there are words like i, is, are, to in the context matrix, which we call stop words and which do not really have any contribution to the meaning or sentiment of words, or the document in general. But they appear very frequently in any corpus, so their count in the matrix will be very high. You can already see how that’s a problem. Say you wanted to get an idea of the contexts that are shared by the words wayanad and munnar. But the stop words occur frequently with all sorts of words and also in the context of these two words. So they would skew the raw frequency values we stored in the co-occurrence matrix.
  • Similarly, you wanted to know if the word united occurring together with states frequently is showing some sort of unique pattern or is it just as meaningless as some states or these states. How would you be able to assess such a pattern with raw frequencies? You might want a better parameter than raw frequencies, which tells you whether two words co-occur by chance, or they provide a unique concept.

We can deal with these problems in one of the following ways:

  1. We might ignore very frequent words in the co-occurrence matrix.
  2. We may use thresholding to say that if any word in the context has a frequency higher than let’s say t, we will take it as t rather than the actual word count. So we are no more interested in keeping the count of that word.
  3. We might form an exhaustive list of stop words and remove those from our corpus before we form the vector representation. This is one of the pre-processing steps that we have read about in the previous section.
  4. We can also solve this problem using different methods which are used to assign importance to words. We can use PMI and TF-IDF, which we will be discussing in the next sections.

Apart from the above mentioned problems and their solutions, co-occurrence matrix also suffers from all the problems that were presented in the one-hot encoding method - scalability issues, high dimensional representation and sparsity. To solve these, we might want to apply decomposition techniques such as SVD (Singular Value Decomposition) and reduce the dimensionality of the co-occurrence matrix. We will have a look at SVD in the next part of the article.

Pointwise Mutual Information (PMI)

In general, PMI is the measure of correlation between two events. It tries to quantify whether two events occur more in each other’s presence, than when they are independent. In our case, the event would be the occurrence of two words together in each other’s context. So PMI between two words tries to find out how often two words co-occur.
Equation 1:Equation 1
Equation 2:Equation 2
P(word1, word2) = Probability of co-occurrence of word1 and word2
P(word1) = Probability of occurence of word1
P(word2) = Probability of occurence of word2

Equation 1 is the original form of the mathematical formula for PMI. It can also be written in the form presented in equation 2.

Note: All these logarithms are with a base of 2.

To understand the concept and its necessity in a better way, let us go back to the example we saw in the co-occurrence matrix. Let us consider the bigram expression united states. We want to understand whether the occurrence of both the words in this expression is depicting a special pattern. Yes, both the words united and states have individual meaning and existence, but when they occur together, do they represent a unique meaning? If they are, is the expression many states also a unique occurrence and an important concept? If not, how exactly do we quantitatively differentiate between both?
Another example - we see in a corpus that the expression new zealand occurs frequently. Why is that? Is it only because new is a frequent word and thus has a higher chance to occur with other less important words, just like many states in the previous example? Or is new zealand actually a unique concept worth its own existence irrespective of the frequency of both the constituent words?
This is what PMI is trying to solve. We already saw that PMI is trying to capture the likelihood of co-occurrence of two words. The formula might give a better intuition as to what PMI captures.

The range of PMI is [-∞, ∞]. Let us now see different possible cases to get a better grip of how PMI varies with various word frequencies.

  1. If two words appear a lot individually, but not much together, their individual probabilities P(word1) and P(word2) will be high. Their joint probability P(word1, word2) however will be low in this case. Hence the PMI is going to be negative.
  2. If both the words occur less frequently, but both occur together at a high frequency, they are probably representing a unique concept. Looking at the formula, P(word1) and P(word2) will be low, but P(word1, word2) will be high, resulting in a high positive PMI score.
  3. If one of the words has a very high frequency, resulting in a higher co-occurrence of both of the words, we will have a high P(word1), a low P(word2) and a high P(word1, word2). So the overall PMI score will be low. This also means that the stop words and their likes will now have a low PMI in the context of a word.
  4. If two words occur independent of each other, we know that their joint probability can be given by the following formula
    Equation 3:Equation 3 So the ratio inside the log will be 1, resulting in a 0 PMI score.
  5. If two words never occur independently, but always tend to occur together, the ratio will become infinity, then PMI will become infinite too.
  6. If two words never occur together, but occur individually at high frequencies, we can see from the equation 2 that the PMI might go to negative infinity.

But it does not really make sense to say that two words are negatively correlated, so we shouldn’t be dealing with negative PMI values at all. Keeping this idea of eliminating any kind of “unrelatedness” from our quantification in mind, Positive PMI (PPMI) was introduced, in which we assign PMI as 0 in case it came to be less than or equal to 0.
Equation 4:Equation 4

Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF is another statistical method to measure how important a word is in a corpus and a document. In easier words, it is a scoring scheme. So it has many applications, like keyword extraction and information retrieval. We also use it in feature extraction (text data representation).

To understand this method in detail, let us first introduce some new terms.

  1. Term frequency is the ratio of count of a word in a document vs count of total words in that document. This measures how frequently a word occurs, thus depicting how much importance it holds.
    Equation 5:Equation 5
    But stop words will create chaos due to their sheer amount of presence. So we penalize them in order to nullify the high TF score they have got. For that we have devised the IDF.
  2. Inverse Document frequency is the log of the number of documents in the corpus vs the number of documents in which the word has occurred.
    Equation 6:Equation 6
    Thus, if a word is present in most of the documents, it will have a low IDF score. Then we calculate the final score for a word as:
    Equation 7:Equation 7

Now that we know how it basically works, what drawbacks can you see for this method?

  • You can see that the IDF score comes to 0 when the word is present in all documents, which in case of stop words is good. But that’s not always true. In the travel destination review example, we can see munnar is present in all the documents, hence it will get IDF score 0. This might suggest that it is a trivial high frequency word, while in reality, it is an important word with contribution to the sentiment of the document.
  • Lastly, this method too suffers from the problem of sparsity.

This article gives a deeper understanding of the problems with TF-IDF.

Note that PMI and TF-IDF are not encoding methods in themselves. The purpose that these two methods serve is to remove the bias that something which is not meaningful (like stop words) but is observed frequently somewhere, should skew the matrix weightages, due to this thing being frequently present everywhere. We use PMI and TF-IDF to produce different flavours in other word representation methods like bag of words or co-occurrence matrix.

. . .

In this article, we learnt why word representation is an important part in NLP, we saw different approaches (Dictionary Lookup, One-hot encoding, Bag of Words, Co-occurrence matrix, PMI and TF-IDF), we saw the signifcance of the context of a word and we discussed their shortcomings too. In the next part, we will get a detailed overview of the Dense embedding methods.