Transformer Networks

In this post, I will describe some of the experiments that I did on transformer networks, starting with the basics and move towards more advanced concepts. The code is written in PyTorch using Python. The complete Google Colab notebook is available at the end of the post.

I first came across the Transformer networks through a class I took at UC Berkeley, CS189: Machine Learning. Described in detail in the paper “Attention Is All You Need”, the transformer architecture surpasses for sequence modeling the previously used models that are built on Recurrent Neural Networks and Long Short Term Memory Models (LSTM).

The Transformer model from the paper, “Attention Is All You Need”

Encoder/Decoder

There are two segments to building the transformer, the Encoder and Decoder stack. The Encoder comprises of six identical layers, where each layer has two sub-layers. The first sub-layer is a multi-head self-attention mechanism, and the second is a “position-wise” full connected feed-forward network. There also exists a residual connection around each of the two sub-layers, followed by layer normalization. Residual connections are important to prevent vanishing gradients.

The decoder, like the encoder, comprises six identical layers. Each layer has three sub-layers. There exists residual connections around each of the sub-layers, followed by layer normalization, like the encoder. The self-attention sub-layer in the decoder stack is modified to “prevent positions from attending to subsequent positions”. According to the paper, this masking with the output embeddings offset by one position, ensures that the predictions for position ‘i’ can depend only on the outputs at positions less than ‘i’.

Attention

Attention is a super cool idea. Though it might seem daunting at first, but it is quite a simple mechanism once you get the intuition behind it. It involves college level introduction linear algebra.

Consider the input to be a three word sentence: “Howdy fellow traveler”

The sentence is tokenized: broken into words for instance: [“Howdy”, “fellow”, “traveler”]

The words are then converted to a vector representation of 512 dimensional length. The above diagram represents the each word as a 512 dimensional length vector, the green squares. Each of these vectors is then transformed by three matrices: the query matrix, the key matrix, and the value matrix to output three 64 dimensional vectors. Each of these matrices are of dimension 512 x 64.

We then calculate the attention scores: a plain dot product between the query vector and the other key vectors. This can be mass processed using matrix vector product by stacking all the key vectors of the words into a matrix. We also dot product between the query vector and the key vector of the current word as well—self attention.

By doing these manipulations we can see the attention scores or how each of the words are related in a specific context. Finally, we push these scores into a softmax and weight the value matrices with the softmax output. This is what attention does in a nutshell.

There is also the concept of multiheaded attention, which is doing the exact same attention multiple times with different query, key, and value matrices. The intuition is different attention scores capture different relations—tense, noun/pronouns, etc.

In multiheaded attention however, we end up several different vectors, but we want a single vector in the end, so we concatenate the several outputs from the different attention heads and transform it with another matrix: Wo, to have one vector in the end.

A more detailed complete presentation can be found at this link: Click Here

A complete Encoder-Decoder Transformer architecture code walkthrough: Click Here

BERT Model

Monishwaran Maheswaran

Github LinkedIn