top of page
hand-businesswoman-touching-hand-artificial-intelligence-meaning-technology-connection-go-

Understanding LSTMs

In my last blog we discussed about shortcomings of RNN which had vanishing gradient problem, which results in not learning longer sequences, responsible for short term memory.


LSTMs and GRUs are seen as solution to short term memories. Now let’s see the functioning of it to understand it.

These have a mechanism of gates which manage the flow of information.

These gates decide which sequence of information to keep or throw away. It is used to store relevant information and forget the information not required.


Almost all state of art of RNN implementation are achieved by LSTMs or GRUs.


These models are used in speech recognition, speech synthesis, and text generation and even to make relevant caption for an image or video.


To understand it, let’s take an example of a movie review:

"Amazing movie! Movie was full of new surprises, audience had to hold back till the last minute to watch it. There was hardly any boring moment. "

Only certain keywords from a review can be remembered to judge if the review was good or bad. We don’t need to memorize the whole review for its analysis.

In above review only words like amazing, surprises would be remembered and rest can be forgotten to make relevant predictions.

This is similar to what LSTM or GRU does, just to keep relevant information and forget about irrelevant or non-scoring information.


In RNN, the words are fed into each cell as machine readable vectors and also the hidden state from last RNN, both of these are in form of vectors which are combined and fed into a tanh activation function.

At each step previous hidden state is passed into the memory to keep the historical state.

Tanh activation function regulates the values in neural network and keeps them in between 1 and -1. It is done so that the values do not explode after so many mathematical transformations in each step.

Also by use of tanh a lot less computational resources are used if compared to if no tanh is used.


In LSTM , it passes information sequentially forward with hidden state and words as vectors.

It also gets an extra input as cell state, which regulates the memory function of them.





In LSTM , it passes information sequentially forward with hidden state and words as vectors.

It also gets an extra input as cell state, which regulates the memory function of them.




Cell state is maintained from first state to last state so as to keep the long term memory. In its journey cell states are added or removed depending on the gate. Gates are responsible for forgetting or keeping the relevant information, let’s see how gates does that.


Gates have sigmoid activation function, sigmoid translates the values in between 0 to 1 instead of -1 to 1 which means any vector multiplied with 1 will be same vector, hence remembering information, while when multiplied with zero will be zero which means forgetting information.


The neural network hence keeps relevant information and forgets unimportant one.


Let’s discuss now about each gate:

1) In forget gate, input word vector and hidden state vector from previous cell is combined and passed into the sigmoid function. Values that come out are zeroes and ones. Zeros means forget and closer to one means to keep.




ft= Sigmoid(Wf .[ht-1,xt] + bf)


2) To update the cell state we have input gate, in which the combined vectors of input state and hidden state are passed to sigmoid function to see which values to keep and forget.

The same input is given to tanh function to regularize the values between -1 to 1.

Now the output from sigmoid functions are multiplied by regularized values of tanh. The values from sigmoid decides which values to keep from tanh function.







it= Sigmoid(Wi .[ht-1,xt] + bi)

~Ct=tanh(Wc.[ht-1,xt] + bc)


3) Soul of LSTMs lies in Cell State. It is kind of like a conveyor line. It runs straight to other cell with just minor changes like addition or multiplication from resulting vectors. It’s very easy for information to just flow along it. In Cell state first of all previous cell state values are multiplied by forget gate factors.

Then we do addition of these values to the output values of input gate. This value will be passed to next cell as cell state.



Ct=ft*Ct-1+ it * ~Ct



4) The hidden state for the next cell is calculated as last hidden state and input value vectors are combined together to be given to sigmoid function in output gate.

The current cell state is passed through tanh function to regularize the values again, these two values are then multiplied to know which values to keep and given to hidden state, which will be taken forward to the next well.

The calculated cell state and hidden state is then given to the next cell for processing.



ot=Sigmoid(Wo.[ht-1,xt] + bo)

ht=ot* tanh(Ct)


The review after applying LSTM will have only important information like:


"Amazing movie! Movie was full of new surprises, audience had to hold back till the last minute to watch it. There was hardly any boring moment. "





LSTMs have been better than RNNs in processing long sequences and had no short term memory problem.

However a better version GRUs were developed on same intuition but for faster processing. However data scientists use both and use any one or combinations of them to have better performance.

My next blog will cover more on GRU intuition.

Thanks for reading!

710 views1 comment

Recent Posts

See All
bottom of page