Why LSTM?

As we have discussed in our previous blogpost on RNN that RNN suffer from vanishing and exploding gradient problem, we can solve this by using LSTM RNN. One of the biggest defect of the RNN is that it is not able to keep long term memory intact and suffer from it.

It works well for small dataset but for long dataset it will suffer from long term memory dependency. This problem is solved by introducing long term memory, in a RNN network which is called as LSTM RNN.

Working of LSTM RNN

To begin learning LSTM RNN, we need to understand the internal architechture and working of RNN which is given as:

rnninner

Also short term memory cell is given as:

shortmem

In short term memory cell we will perform a weighted multiplication of the input and the hidden state of the previous cell and feed this result into tanh function which gives the hidden state vector for next iteration.

But now in LSTM RNN, we will add a long term memory cell which is called as long term memory cell.

longshortmem

Here in above figure we can see a new cell state is added which is a long term memory cell. The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged.

The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates. Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation.

The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means "let nothing through," while a value of one means "let everything through!"

Now Lets explore how a long term memory cell works step by step:

Step 1: Forget Gate
```
[Picture Source](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
```
- Here in this state we will decide which word to keep and which word to forget. This is simply done by a sigmoid gate layer called as forget gate. Here ht−1 and xt outputs a number between 0 and 1 for each number in the cell state Ct−1. A 1 represents "completely keep this" while a 0 represents "completely get rid of this."
- For example, for next sentence prediction task, we will use forget gate to decide which word to keep and which word to forget.
Step 2: Input Gate

Picture Source
- Here in this state we mainly perform a operation to keep a word as a long term memory and it is done by using two function they are sigmoid and tanh whose equation is shown in above figure.
- In that equation we are performing a weighted multiplication of the input and the hidden state of the previous cell and then we are feeding it to a sigmoid function as well as to a tanh function. Their output is named as it and C~t.
- Now we perform a multiplication operation on it and C~t which is given as it∗C~t.
- This is the new candidate values, scaled by how much we decided to update each state value. In the case of the language model, this is where we’d actually drop the information about the old subject’s gender and add the new information, as we decided in the previous steps.
- Finally we add previous state multiplied by ft and current state multiplied by it which store the required information.
  
  Picture Source
Step 3: Output Gate Picture Source
- In this cell we perform a operation to get our desired output. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through tanh (to push the values to be between −1 and 1) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.
- For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in case that’s what is coming next. For example, it might output whether the subject is singular or plural, so that we know what form a verb should be conjugated into if that’s what follows next.

Bidirectional LSTM RNN

Rather than encoding the sequence in the forward direction only, we encode it in the backward direction as well, and concatenate the results from both the forward and backward directions at each timestep.

The encoded representation of each word now understands the words before and after that specific word.

One most important is bi-directional LSTM can only be used for encoding a sequence. They cannot be used to generate words, because at test time, we do not have the available words to use the backward LSTM.

bilstm

We will pass our sentence "I will swim today." from forward as well as from backward direction and concat their results which gives a encoded representations.

Summary

All the remarkable achievement we are getting is due to LSTM RNN, which is the main building block of the Natural Language Processing.