Transformers-In depth understanding of its working

Processing language has come a long way. Starting from Bag of words , to Recurrent Neural networks to Long Short Term Memories, and after overcoming the problems each had, it has improved.

Bag of Words was kind of sparse matrix where if we have vocabulary of 10 million words each word will be represented by a sparse matrix with majority of zeroes and a one where index of word is.

RNNs were good to handle sequence of words but there was a problem of vanishing and exploding gradient problems, it was very good at keeping sequence information but not very long term ones. Also vanilla models couldn’t have context of words coming in future of sequence.

To resolve that bidirectional recurrent neural networks came which had input as the hidden state of previous words and even next words.

LSTMs came to rescue for RNN problem of short term memory, this had complex cell state where important information was given importance and long sequences could make sense. However this model too was not without any disadvantage, being slow to train, very long gradient paths like for 100 words has 100 layer network, needs labelled data for the task they do, transfer learning doesn’t work well on it, can be a bit slow.


Transformers are taking the space of Natural Language processing by storm.

These are pushing boundaries and are being used in any of the NLP assignments which includes, question answer chatbots, translation models, text generation or even search engines.

These are rage and doesn’t need to be pre-trained and are good example of transfer learning.

Now how do they work?

Transformers outperform RNNs, GRUs or LSTMs. These doesn’t have a chain structure to keep sequence intact.

BERT, and GPT are examples of transformers only.

To understand transformers, we must first know about “Attention” mechanism.


In the text generation model, transformer is fed input and has also knowledge of previous words based on which it predicts the words ahead.

Each word knows that attention is to be given to which word said previously based on their mechanism of attention.

How the model learns is by back propagation, in recurrent neural network window of backward propagation was much lesser than that of transformers.

LSTMS also have this mechanism but transformers can do it on infinite window of memory if given enough resources.

Transformers are attention based encoder –decoder type of model where encoder is where inputs are processed and turned into a continuous presentation of inputs that holds every learned information of that input.



The decoder then take that information from encoder and step by step decodes and gives the output, while also being fed the previous output.

Encoder


Step 1: Word Embedding

In Encoder input is given as word embedded layer, it is then combined with positional encoding, which keeps the information of positions.

As we are not using RNNs here sequential information is kept as positional encoding which is very smart way of retaining the positions of text.

Step 2 : Positional Encoding

Positional encoding is done by using sine and cosine functions which are represented by


These positional encodings are added to their word embedding factors to get the positional input encodings.

Sin and cosine functions have linear properties which the model can easily compute.


Step 3 : Multi-headed Attention

The input is fed into 3 fully connected layers to form query, key and value vectors.

a) Self Attention: To make query , key and value vectors

Example of query is search text on youtube or google, key is the video title or article title searched for associated with the query text.

The query and keys are dot products to make scores, highest scores are for those words which are to be given more attention in search.



In short multi-headed attention model is a vector representation of the input and its sequence with importance of each word which has to be given attention to. And how each word will attend to all other words in sequence.




Scores are scaled down by dividing it by square root of dimensions of queries and keys.

This is to have more stable gradients, as multiplying values can have exploding gradient problem. These are called scaled scores.

Now Softmax is applied to scaled scores to get probability between 0 to 1 for each word, the higher probability words will get more attention and lesser values will be ignored.

Softmax(x)i=exp(xi)/ ∑exp(xj)


Now these attention weights are multiplied by value words to get output.

The higher softmax scores will keep the values to be given more attention on higher side.

Lower scores will be termed as irrelevant.

Output vectors are added to linear layer to be processed.

You will get 2 output vectors if we have 2 self attention models. Both model values are concatenated and given to linear layer for processing.

In theory each head learns something different so two heads are considered for the final output from linear layer, this is just to give the model more representation power.

Step 4:

Multi-headed output is connected to an original input, this is called residual connection.

The output of residual connection is given to layer Normalization.

The layer normalization output gets fed into feed forward network for further processing. The feed forward network consist of a couple of linear layers with Relu activation in between.


The output of it is again added to input of feed forward network and again fed into a layer of normalization and further normalized.

The residual connections help the network to train by long gradients to flow through the networks directly.

The layer normalization is used to stabilize the results and substantially producing the training time necessary.

Point-wise feed forward is used to further process the attention output and giving it a weighted representation.

All this is to encode the input to get a continuous information with attention information. This will help decoder to give attention to important words during the process of decoding.

Encoder layers can be stacked N times to further encode information where each layer can learn different attention representations. This can boost and powerfully the predictive power of transformer.

Decoder layer

The decoder has a job of generating text sequences, it has similar layers as encoders. It has 2 multi-headed attention layers, 1 point-wise feed forward layer with residual connections and point-wise feed forward layer.

Decoder Multi-Headed Attention

These layers behave similar to encoder but have different job. It has a linear layer and softmax to calculate probabilities as outputs.

The decoders take the previous layer outputs as inputs as well as encoder outputs which have attention information.

The decoders stop decoding when they generate the n tokens as output.