Why Attention Mechanism?

  • First the attention mechanism was introduced by Bahdanau et al. (2014), to address the bottleneck problem that arises with the use of a fixed-length encoding vector in sequence to sequence model, where the decoder would have limited access to the information provided by the input. This is thought to become especially problematic for long and/or complex sequences, where the dimensionality of their representation would be forced to be the same as for shorter or simpler sequences.
  • By the introduction of this mechanism, the decoder is able to access the information provided by the input, and thus the decoder is able to learn to predict the next word in the sequence.

How Attention Mechanism Works?

  • Here in attention model we use Bidirectional LSTM in a encoder side and LSTM in a decoder side. To clear the concept of LSTM and BIdirectional LSTM go to my previous post

    atten

    • Here according to timestamp T we select the words from the input sequence and generete a vector h_i so that attention can be given, which is the hidden state of the Bidirectional LSTM.
    • Now some random weight alpha_1, alpha_2, alpha_3 is generated using previous state of decoder and the current state of encoder.
    • After generating this value we multiply h_i and alpha_1, alpha_2, alpha_3 and then we get the context vector. Now according to this context vector our decoder predict the word.
  • According to the original paper, attention mechanism is a step by step computations of the Alignment Scores, Weights, Context Vectors. Let me explain them step by step:

    • Alignment Scores :

      • The alignment model takes the encoded hidden states, h_i , and the previous decoder output, st_1 , to compute a score, et_1.
      • That indicates how well the elements of the input sequence align with the current output at position, t.
      • The alignment model is represented by a function, a(.) , which can be implemented by a feedforward neural network:

        alignment

    • Weights :

      • The weights, at,i, are computed by applying a softmax operation to the previously computed alignment scores:

        weights

    • Context Vectors :

      • The unique context vectors, ct, are fed to the decoder at time t. They are computed by taking the weighted average of the hidden states, h_i , with the weights, at_i ,:

        context1

  • Here in attention mechanism we use three components namely queries Q, keys K, and values V.

  • If we had to compare these three components to the attention mechanism as proposed by Bahdanau et al., then the query would be analogous to the previous decoder output, st_1 , while the values would be analogous to the encoded inputs, hi. In the Bahdanau attention mechanism, the keys and values are the same vector.

  • Step By Step Implementation on computation for Attention Mechanism:

    • Each query vector, q = st_1 , is matched against a database of keys to compute a score value. This matching operation is computed as the dot product of the specific query under consideration with each key vector, k_i :

      query

    • The scores are passed through a softmax operation to generate the weights:

      softmax

    • The generalized attention is then computed by a weighted sum of the value vectors, Vki, where each value vector is paired with a corresponding key:

      attention

  • Within the context of machine translation, each word in an input sentence would be attributed its own query, key and value vectors. These vectors are generated by multiplying the encoder’s representation of the specific word under consideration, with three different weight matrices that would have been generated during training.

  • When the generalized attention mechanism is presented with a sequence of words, it takes the query vector attributed to some specific word in the sequence and scores it against each key in the database. In doing so, it captures how the word under consideration relates to the others in the sequence. Then it scales the values according to the attention weights (computed from the scores), in order to retain focus on those words that are relevant to the query. In doing so, it produces an attention output for the word under consideration.
  • This is the self attention mechanism which i will explain in my next blog post so stay tuned.