All about Word2Vec

We have learned about the two methods previously which extract the numerical features from the textual data, they are BOW and TF-IDF. But their main drawbacks is they did not store the semantic relationship and they generate lots of feature which sometimes cumbersome to use.

To solve this problem very efficiently, there is Word2Vec method, which creates word embeddings. Word embeddings are an integral part of solving many problems in NLP. They depict how humans understand language to a machine. You can imagine them as a vectorized representation of text. Word2Vec, a common method of generating word embeddings, has a variety of applications such as text similarity, recommendation systems, sentiment analysis, etc.

Before we get into word2vec, let’s establish an understanding of what word embeddings are. This is important to know because the overall result and output of word2vec will be embeddings associated to each unique word passed through the algorithm.

Word embeddings is a technique where individual words are transformed into a numerical representation of the word (a vector). Where each word is mapped to one vector, this vector is then learned in a way which resembles a neural network. The vectors try to capture various characteristics of that word with regard to the overall text. These characteristics can include the semantic relationship of the word, definitions, context, etc. With these numerical representations, you can do many things like identify similarity or dissimilarity between words.

Clearly, these are integral as inputs to various aspects of machine learning. A machine cannot process text in its raw form, thus converting the text into an embedding will allow users to feed the embedding to classic machine learning models. The simplest embedding would be a one hot encoding of text data where each vector would be mapped to a category.

How Word2Vec works?

Word2Vec is a recent breakthrough in the world of NLP. Tomas Mikolov a Czech computer scientist and currently a researcher at CIIRC ( Czech Institute of Informatics, Robotics and Cybernetics) was one of the leading contributors towards the research and implementation of word2vec. It has two types they are:

  • Continous Bag of Words Model

    • In this method context word is given to the model and it generates a vector which represents the target word, which is shown in figure below. Windows length(how far should we look for a context word) is set which is the number of text to be taken into consideration, and this selected word according to window length is send to the enbedding layer which has the dimension as vocab size * dimension(basically 300) and it generates the embedding vector of context word that is fed to model.

    • Now we will take the average of all embeddings that is generated from the context word by feeding embeddings to average layer. This layer generates a vector which is the average of all embeddings of context word.

    • This averaged embedding is passed to the softmax layer to produce a predicted word. Here now by comparing with actual word we can calculate the loss and backpropagate to get a near perfect weight which will generates near perfect predicted word.

      word2vec1

  • Skipgram Model

    • In this model we have to pass the target word by pairing with some random context word and then the best pair will give the word embedding.
    • This word embedding is now feed to a merge layer where dot operation takes place between the embeddings of target word and context word which gives the single embedding.
    • After this we feed our embeddings to the sigmoid layer to produce the label as 0 or 1. 1 is for the pair which is perfect match and 0 for the pair which is not perfect match.
    • Now we calculate the loss and backpropagate to get a near perfect weight which will help to predict the pair near perfectly.

skipgram

Code Implementatiion of Word2Vec

import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import gensim.downloader as api
word2vec_model = api.load('word2vec-google-news-300')
word2vec_model["beautiful"]
array([-0.01831055,  0.05566406, -0.01153564,  0.07275391,  0.15136719,
       -0.06176758,  0.20605469, -0.15332031, -0.05908203,  0.22851562,
       -0.06445312, -0.22851562, -0.09472656, -0.03344727,  0.24707031,
        0.05541992, -0.00921631,  0.1328125 , -0.15429688,  0.08105469,
       -0.07373047,  0.24316406,  0.12353516, -0.09277344,  0.08203125,
        0.06494141,  0.15722656,  0.11279297, -0.0612793 , -0.296875  ,
       -0.13378906,  0.234375  ,  0.09765625,  0.17773438,  0.06689453,
       -0.27539062,  0.06445312, -0.13867188, -0.08886719,  0.171875  ,
        0.07861328, -0.10058594,  0.23925781,  0.03808594,  0.18652344,
       -0.11279297,  0.22558594,  0.10986328, -0.11865234,  0.02026367,
        0.11376953,  0.09570312,  0.29492188,  0.08251953, -0.05444336,
       -0.0090332 , -0.0625    , -0.17578125, -0.08154297,  0.01062012,
       -0.04736328, -0.08544922, -0.19042969, -0.30273438,  0.07617188,
        0.125     , -0.05932617,  0.03833008, -0.03564453,  0.2421875 ,
        0.36132812,  0.04760742,  0.00631714, -0.03088379, -0.13964844,
        0.22558594, -0.06298828, -0.02636719,  0.1171875 ,  0.33398438,
       -0.07666016, -0.06689453,  0.04150391, -0.15136719, -0.22460938,
        0.03320312, -0.15332031,  0.07128906,  0.16992188,  0.11572266,
       -0.13085938,  0.12451172, -0.20410156,  0.04736328, -0.296875  ,
       -0.17480469,  0.00872803, -0.04638672,  0.10791016, -0.203125  ,
       -0.27539062,  0.2734375 ,  0.02563477, -0.11035156,  0.0625    ,
        0.1953125 ,  0.16015625, -0.13769531, -0.09863281, -0.1953125 ,
       -0.22851562,  0.25390625,  0.00915527, -0.03857422,  0.3984375 ,
       -0.1796875 ,  0.03833008, -0.24804688,  0.03515625,  0.03881836,
        0.03442383, -0.04101562,  0.20214844, -0.03015137, -0.09619141,
        0.11669922, -0.06738281,  0.0625    ,  0.10742188,  0.25585938,
       -0.21777344,  0.05639648, -0.0065918 ,  0.16113281,  0.11865234,
       -0.03088379, -0.11572266,  0.02685547,  0.03100586,  0.09863281,
        0.05883789,  0.00634766,  0.11914062,  0.07324219, -0.01586914,
        0.18457031,  0.05322266,  0.19824219, -0.22363281, -0.25195312,
        0.15039062,  0.22753906,  0.05737305,  0.16992188, -0.22558594,
        0.06494141,  0.11914062, -0.06640625, -0.10449219, -0.07226562,
       -0.16992188,  0.0625    ,  0.14648438,  0.27148438, -0.02172852,
       -0.12695312,  0.18457031, -0.27539062, -0.36523438, -0.03491211,
       -0.18554688,  0.23828125, -0.13867188,  0.00296021,  0.04272461,
        0.13867188,  0.12207031,  0.05957031, -0.22167969, -0.18945312,
       -0.23242188, -0.28710938, -0.00866699, -0.16113281, -0.24316406,
        0.05712891, -0.06982422,  0.00053406, -0.10302734, -0.13378906,
       -0.16113281,  0.11621094,  0.31640625, -0.02697754, -0.01574707,
        0.11425781, -0.04174805,  0.05908203,  0.02661133, -0.08642578,
        0.140625  ,  0.09228516, -0.25195312, -0.31445312, -0.05688477,
        0.01031494,  0.0234375 , -0.02331543, -0.08056641,  0.01269531,
       -0.34179688,  0.17285156, -0.16015625,  0.07763672, -0.03088379,
        0.11962891,  0.11767578,  0.20117188, -0.01940918,  0.02172852,
        0.23046875,  0.28125   , -0.17675781,  0.02978516,  0.08740234,
       -0.06176758,  0.00939941, -0.09277344, -0.203125  ,  0.13085938,
       -0.13671875, -0.00500488, -0.04296875,  0.12988281,  0.3515625 ,
        0.0402832 , -0.12988281, -0.03173828,  0.28515625,  0.18261719,
        0.13867188, -0.16503906, -0.26171875, -0.04345703,  0.0100708 ,
        0.08740234,  0.00421143, -0.1328125 , -0.17578125, -0.04321289,
       -0.015625  ,  0.16894531,  0.25      ,  0.37109375,  0.19921875,
       -0.36132812, -0.10302734, -0.20800781, -0.20117188, -0.01519775,
       -0.12207031, -0.12011719, -0.07421875, -0.04345703,  0.14160156,
        0.15527344, -0.03027344, -0.09326172, -0.04589844,  0.16796875,
       -0.03027344,  0.09179688, -0.10058594,  0.20703125,  0.11376953,
       -0.12402344,  0.04003906,  0.06933594, -0.34570312,  0.03881836,
        0.16210938,  0.05761719, -0.12792969, -0.05810547,  0.03857422,
       -0.11328125, -0.1953125 , -0.28125   , -0.13183594,  0.15722656,
       -0.09765625,  0.09619141, -0.09960938, -0.00285339, -0.03637695,
        0.15429688,  0.06152344, -0.34570312,  0.11083984,  0.03344727],
      dtype=float32)

Model Understanding the Meaning of Word

word2vec_model.most_similar("girl")
[('boy', 0.8543272018432617),
 ('teenage_girl', 0.7927976250648499),
 ('woman', 0.7494640946388245),
 ('teenager', 0.717249870300293),
 ('schoolgirl', 0.7075953483581543),
 ('teenaged_girl', 0.6650916337966919),
 ('daughter', 0.6489864587783813),
 ('mother', 0.6478164196014404),
 ('toddler', 0.6473966836929321),
 ('girls', 0.6154742240905762)]

QUEEN-GIRL+BOY = KING

This is the most amazing result we get from this model, here we are subtracting embedding of word girl with embedding queen. Then the kingship quality of the word queen is remained and when added with the embedding of word bor it gives the result king since that is logical. How amazing.

word2vec_model.most_similar(positive=['boy', 'queen'], negative=['girl'], topn=1)
[('king', 0.7298422455787659)]

Visualizing The Word Embedding Using TSNE

we can see from the plot that their is a sense of gender for human type vocabulary and it clearly separates human and non human objects,

vocab = ["boy", "girl", "man", "woman", "king", "queen", "banana", "apple", "mango", "fruit", "coconut", "orange"]

def tsne_plot(model):
    labels = []
    wordvecs = []

    for word in vocab:
        wordvecs.append(model[word])
        labels.append(word)
    
    tsne_model = TSNE(perplexity=3, n_components=2, init='pca', random_state=42)
    coordinates = tsne_model.fit_transform(wordvecs)

    x = []
    y = []
    for value in coordinates:
        x.append(value[0])
        y.append(value[1])
        
    plt.figure(figsize=(8,8)) 
    for i in range(len(x)):
        plt.scatter(x[i],y[i])
        plt.annotate(labels[i],
                     xy=(x[i], y[i]),
                     xytext=(2, 2),
                     textcoords='offset points',
                     ha='right',
                     va='bottom')
    plt.show()

tsne_plot(word2vec_model)
/usr/local/lib/python3.7/dist-packages/sklearn/manifold/_t_sne.py:793: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
  FutureWarning,
/usr/local/lib/python3.7/dist-packages/sklearn/manifold/_t_sne.py:986: FutureWarning: The PCA initialization in TSNE will change to have the standard deviation of PC1 equal to 1e-4 in 1.2. This will ensure better convergence.
  FutureWarning,