Glove
All about Theoritical Understanding and practical implementation of Glove.
- What is Glove?
- Why Glove is better than Word2Vec?
- How Glove works?
- Mathematics behind Glove
- Now comes cost function
- Code Implementation of Glove
- Downloading the word vectors
- Vector representation of a word
- Model Understanding meaning of word vectors
- QUEEN-GIRL+BOY = KING
- Visualizing The Word Embedding Using TSNE
- Summary
What is Glove?
GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.
GloVe is an approach to marry both the global statistics of matrix factorization techniques like LSA with the local context-based learning in word2vec.
Rather than using a window to define local context, GloVe constructs an explicit word-context or word co-occurrence matrix using statistics across the whole text corpus. The result is a learning model that may result in generally better word embeddings.
-
What is Local and Global Statistics?
- Local statistics are the statistics that are computed within a window of words. That I have already explained in my previous blog of Word2Vec.
- Global statistics are the statistics that are computed across the whole text corpus.
Why Glove is better than Word2Vec?
Since word2vec is a best performing model why not use it? The reason doesn’t lie in performance, but the fundamentals of the solution formulation. Remember that, Word2vec relies only on local statistics of language. That is, the semantics learnt for a given word, is only affected by the surrounding words.
How Glove works?
Before beginning we have to know why glove model derive semantic relationships between words from the co-occurrence matrix and what is cooccurance matrix.
-
Co-occurance matrix is the matrix that shows the co-occurrence of words. That means how many time some word in a text is repeated with respect to other words in text.
-
For example take a two sentences, "I love NLP" and "I love to make videos" and its co-occurance matrix is given as:
-
Now lets take pair of words with respect to "I" that means "I Love" which is repeated two times in two sentence, so we can see in above matrix with respect to "I", "Love" comes two times and similarly all the information is stored in a matrix.
-
-
Then How Glove Model utilizes it?
-
Now lets know about a mathematical expression used in the Glove Paper:
-
This expression shows j is a context word, for example x_I,love = 2. This means word love is repeated 2 times with respect to "I"
-
-
Now to explain the power of co-occurence matrix, lets know about the probabilistic equation and it is given as:
- Here we can see numerator contains the same expression as we explain and denominator consist the expression that denotes summing all column value of a particular rows.
-
Now from Glove paper:
- From above table we can see that probability of k with respect to ice and of k with respect to solid. We can see we set k equals solid, gas, water and random word.
- Now we devide one probality by other and keep the result in a tabular format which shows amazing results. We can see ratio of probability of solid with respect to ice and probability of solid with respect to steam is high because solid and ice is nearly similar and that has a higher probability.
Mathematics behind Glove
I am using the following notation which is slightly different from the paper due to difficulties rendering Latex on Jupyter Notebook.
- w, u — Two separate embedding layers.
- w* — Transpose of w
- X — co-occurrence matrix
- bw and bu — Biases of w and u respectively
There are three issues to be solved to arrive at the word embedding using Glove.
- We don’t have an equation, e.g. F(i,j,k) = P_ik/P_jk, but just an expression.
- Word vectors are high-dimensional vectors, however P_ik/P_jk is a scalar. So there’s a dimensional mismatch.
- There are three entities involved (i, j, and k). But computing loss function with three elements can get hairy, and needs to be reduced to two.
-
Solving issue one is easy by writing the expression. Assume that there is a function F which takes in word vectors of i,j and k which outputs the ratio we’re interested in.
F(w_i,w_j, u_k) = P_ik/P_jk
- Here we are using only two words embedding layers , i.e. w and u why? The paper says, often both these layers will perform equivalently and will only differ by the different random initialization. However having two layers help the model to reduce overfitting.
-
Word vectors are linear systems. For example, you can perform arithmetic in embedding space, e.g.
w{king} — w{male} + w{female} = w{queen} so now change above expression as:
- F(w_i — w_j, u_k) = P_ik/P_j
-
Now solving second issue
-
F(w_i — w_j, u_k) = P_ik/P_j
-
In above expression we have to change left hand side part into scaler because it's a vector but right hand side expression is scaler. So this is done by introducing a transpose and a dot product between the two entities the following way.
F((w_i — w_j)* . u_k) = P_ik/P_jk
- If you assume a word vector as a Dx1 matrix, (w_i — w_j)* will be 1xD shaped which gives a scalar when multiplied with u_k.
-
-
Now how we choose function F, if we assume F has a certain property (i.e. homomorphism between additive group and the multiplicative group) which gives,
F(w_i u_k — w_j u_k) = F(w_i u_k)/F(w_j u_k) = P_ik/P_jk
-
In other words this particular homomorphism ensures that the subtraction F(A-B) can also be represented as a division F(A)/F(B) and get the same result. And therefore,
F(w_i u_k)/F(w_j u_k) = P_ik/P_jk and
F(w_i* u_k) = P_ik
-
If we assume F=exp the above homomorphism property is satisfied. Then let us set,
Exp(w_i* u_k) = P_ik=X_ik/X_i
and
w_i* u_k = log(X_ik) — log(X_i)
Next, X_i is independent of k, we move log(X_i) to LHS,
w_i* u_k + log(X_i)= log(X_ik)
Now given that we do not have a bias yet in the equation, we’ll get a bit creative and express log(X_i) in neural network parlance we get,
w_i* u_k + bw_i +bu_k — log(X_ik) = 0
where bw and bu are biases of the network.
-
-
Now comes cost function
In an ideal setting, where you have perfect word vectors, the above expression will be zero. In other words, that’s our goal or objective. So we will be setting the LHS expression as our cost function.
J(w_i, w_j)= (w_i* u_j + bw_i +bu_j — log(X_ij))²
The square makes this a mean square cost function. In above expression k has been replaced with j.
Above expression is not a finalized cost function since if X_ij is zero , then log(0) will be infinite and thats we have to solve. So to solve this problem we use the expression as in place of log(X_ij) we put log(1+X_ij) which is known as laplacian smoothening. But in a GloVe paper propose a sleeker way of doing this. That is to introduce a weighting function.
J = f(X_ij)(w_i^T u_j + bw_i +bu_j — log(X_ij))²
where f(Xij) = (x/x{max})^a if x < x_{max} else 0
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import gensim.downloader as api
glove_model = api.load('glove-wiki-gigaword-300')
glove_model["beautiful"]
glove_model.most_similar("girl")
QUEEN-GIRL+BOY = KING
This is the most amazing result we get from this model, here we are subtracting embedding of word girl with embedding queen. Then the kingship quality of the word queen is remained and when added with the embedding of word boy it gives the result king since that is logical. How amazing.
glove_model.most_similar(positive=['boy', 'queen'], negative=['girl'], topn=1)
vocab = ["boy", "girl", "man", "woman", "king", "queen", "banana", "apple", "mango", "fruit", "coconut", "orange"]
def tsne_plot(model):
labels = []
wordvecs = []
for word in vocab:
wordvecs.append(model[word])
labels.append(word)
tsne_model = TSNE(perplexity=3, n_components=2, init='pca', random_state=42)
coordinates = tsne_model.fit_transform(wordvecs)
x = []
y = []
for value in coordinates:
x.append(value[0])
y.append(value[1])
plt.figure(figsize=(8,8))
for i in range(len(x)):
plt.scatter(x[i],y[i])
plt.annotate(labels[i],
xy=(x[i], y[i]),
xytext=(2, 2),
textcoords='offset points',
ha='right',
va='bottom')
plt.show()
tsne_plot(glove_model)