Unfolding the universe of possibilities..

Every load time is a step closer to discovery.

The A-Z of Transformers: Everything You Need to Know

Everything you need to know about Transformers, and how to implement them

Image by author

Why another tutorial on Transformers?

You have probably already heard of Transformers, and everyone talks about it, so why making a new article about it?

Well, I am a researcher, and this requires me to have a very deep understanding of the tools I use (because if you don’t understand them, how can you identify where they are wrong and how you can improve them, right?).

As I ventured deeper into the world of Transformers, I found myself buried under a mountain of resources. And yet, despite all that reading, I was left with a general sense of the architecture and a trail of lingering questions.

In this guide, I aim to bridge that knowledge gap. A guide that will give you a strong intuition on Transformers, a deep dive into the architecture, and the implementation from scratch.

I strongly advise you to follow the code on Github:

awesome-ai-tutorials/NLP/007 – Transformers From Scratch at main · FrancoisPorcher/awesome-ai-tutorials

Enjoy! 🤗

A little bit of History first:

Many attribute the concept of the attention mechanism to the renowned paper “Attention is All You Need” by the Google Brain team. However, this is only part of the story.

The roots of the attention mechanism can be traced back to an earlier paper titled “Neural Machine Translation by Jointly Learning to Align and Translate” authored by Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio.

Bahdanau’s primary challenge was addressing the limitations of Recurrent Neural Networks (RNNs). Specifically, when encoding lengthy sentences into vectors using RNNs, crucial information was often lost.

Drawing parallels from translation exercises — where one often revisits the source sentence while translating — Bahdanau aimed to allocate weights to the hidden states within the RNN. This approach yielded impressive outcomes, and is depicted in the following diagram.

Image from Neural machine translation by jointly learning to align and translate

However, Bahdanau wasn’t the only one tackling this issue. Taking cues from his groundbreaking work, the Google Brain team posited a bold idea:

“Why not strip everything down and focus solely on the attention mechanism?”

They believed it wasn’t the RNN but the attention mechanism that was the primary driver behind the success.

This conviction culminated in their paper, aptly titled “Attention is All You Need”.

Fascinating, right?

The Transformer Architecture

1. First things first, Embeddings

This diagram represents the Transformer architecture. Don’t worry if you don’t understand anything at first, we will cover absolutely everything.

Embeddings, Image from article modified by author

From Text to Vectors — The Embedding Process: Imagine our input is a sequence of words, say “The cat drinks milk”. This sequence has a length termed as seq_len. Our immediate task is to convert these words into a form that the model can understand, specifically vectors. That’s where the Embedder comes in.

Each word undergoes a transformation to become a vector. This transformation process is termed as ‘embedding’. Each of these vectors or ‘embeddings’ has a size of d_model = 512.

Now, what exactly is this Embedder? At its core, the Embedder is a linear mapping (matrix), denoted by E. You can visualize it as a matrix of size (d_model, vocab_size), where vocab_size is the size of our vocabulary.

After the embedding process, we end up with a collection of vectors of size d_model each. It’s crucial to understand this format, as it’s a recurrent theme — you’ll see it across various stages like encoder input, encoder output, and so on.

Let’s code this part:

class Embeddings(nn.Module):
def __init__(self, d_model, vocab):
super(Embeddings, self).__init__()
self.lut = nn.Embedding(vocab, d_model)
self.d_model = d_model

def forward(self, x):
return self.lut(x) * math.sqrt(self.d_model)

Note: we multiply by d_model for normalization purposes (explained later)

Note 2: I personally wondered if we used a pre-trained embedder, or at least start from a pre-trained one and fine-tune it. But no, the embedding is fully learned from scratch and initialized randomly.

Positional Encoding

Why Do We Need Positional Encoding?

Given our current setup, we possess a list of vectors representing words. If fed as-is to a transformer model, there’s a key element missing: the sequential order of words. Words in natural languages often derive meaning from their position. “John loves Mary” carries a different sentiment from “Mary loves John.” To ensure our model captures this order, we introduce Positional Encoding.

Now, you might wonder, “Why not just add a simple increment like +1 for the first word, +2 for the second, and so on?” There are several challenges with this approach:

Multidimensionality: Each token is represented in 512 dimensions. A mere increment would not suffice to capture this complex space.Normalization Concerns: Ideally, we want our values to lie between -1 and 1. So, directly adding large numbers (like +2000 for a long text) would be problematic.Sequence Length Dependency: Using direct increments is not scale-agnostic. For a long text, where the position might be +5000, this number does not truly reflect the relative position of the token in its associated sentence. And the meaning of a world depends more on its relative position in a sentence, than its absolute position in a text.

If you studied mathematics, the idea of circular coordinates — specifically, sine and cosine functions — should resonate with your intuition. These functions provide a unique way to encode position that meets our needs.

Given our matrix of size (seq_len, d_model), our aim is to add another matrix, the Positional Encoding, of the same size.

Here’s the core concept:

For every token, the authors suggest providing a sine coordinate of the pairwise dimensions (2k) a cosine coordinate to (2k+1).If we fix the token position, and we move the dimension, we can see that the sine/cosine decrease in frequencyIf we look at a token that is further in the text, this phenomenon happens more rapidly (the frequency is increased)Image from article

This is summed up in the following graph (but don’t scratch your head too much on this). The Key take away is that Positional Encoding is a mathematical function that allows the Transformer to keep an idea of the order of tokens in the sentence. This is a very active area or research.

Positional Embedding, Image by authorclass PositionalEncoding(nn.Module):
“Implement the PE function.”

def __init__(self, d_model, dropout, max_len=5000):
super(PositionalEncoding, self).__init__()
self.dropout = nn.Dropout(p=dropout)

# Compute the positional encodings once in log space.
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1)
div_term = torch.exp(
torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)
)
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0)
self.register_buffer(“pe”, pe)

def forward(self, x):
x = x + self.pe[:, : x.size(1)].requires_grad_(False)
return self.dropout(x)

The Attention Mechanism (Single Head)

Let’s dive into the core concept of Google’s paper: the Attention Mechanism

High-Level Intuition:

At its core, the attention mechanism is a communication mechanism between vectors/tokens. It allows a model to focus on specific parts of the input when producing an output. Think of it as shining a spotlight on certain parts of your input data. This “spotlight” can be brighter on more relevant parts (giving them more attention) and dimmer on less relevant parts.

For a sentence, attention helps determine the relationship between words. Some words are closely related to each other in meaning or function within a sentence, while others are not. The attention mechanism quantifies these relationships.

Example:

Consider the sentence: “She gave him her book.”

If we focus on the word “her”, the attention mechanism might determine that:

It has a strong connection with “book” because “her” is indicating possession of the “book”.It has a medium connection with “She” because “She” and “her” likely refer to the same entity.It has a weaker connection with other words like “gave” or “him”.

Technical Dive into the Attention mechanism

Scaled Dot-Product Attention, image from article

For each token, we generate three vectors:

Query (Q):

Intuition: Think of the query as a “question” that a token poses. It represents the current word and tries to find out which parts of the sequence are relevant to it.

2. Key (K):

Intuition: The key can be thought of as an “identifier” for each word in the sequence. When the query “asks” its question, the key helps in “answering” by determining how relevant each word in the sequence is to the query.

3. Value (V):

Intuition: Once the relevance of each word (via its key) to the query is determined, we need actual information or content from those words to assist the current token. This is where the value comes in. It represents the content of each word.

How are Q, K, V generated?

Q, K, V generation, image by author

The similarity between a query and a key is a dot product (measures the similarity between 2 vectors), divided by the standard deviation of this random variable, to have everything normalized.

Attention formula, Image from article

Let’s illustrate this with an example:

Let’s image we have one query, and want to figure the result of the attention with K and V:

Q, K, V, Image by author

Now let’s compute the similarities between q1 and the keys:

Dot Product, Image by author

While the numbers 3/2 and 1/8 might seem relatively close, the softmax function’s exponential nature would amplify their difference.

Attention weights, Image by author

This differential suggests that q1 has a more pronounced connection to k1 than k2.

Now let’s look at the result of attention, which is a weighted (attention weights) combination of the values

Attention, Image by author

Great! Repeating this operation for every token (q1 through qn) yields a collection of n vectors.

In practice this operation is vectorized into a matrix multiplication for more effectiveness.

Let’s code it:

def attention(query, key, value, mask=None, dropout=None):
“Compute ‘Scaled Dot Product Attention'”
d_k = query.size(-1)
scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
p_attn = scores.softmax(dim=-1)
if dropout is not None:
p_attn = dropout(p_attn)
return torch.matmul(p_attn, value), p_attn

Multi-Headed Attention

What’s the Issue with Single-Headed Attention?

With the single-headed attention approach, every token gets to pose just one query. This generally translates to it deriving a strong relationship with just one other token, given that the softmax tends to heavily weigh one value while diminishing others close to zero. Yet, when you think about language and sentence structures, a single word often has connections to multiple other words, not just one.

To tackle this limitation, we introduce multi-headed attention. The core idea? Let’s allow each token to pose multiple questions (queries) simultaneously by running the attention process in parallel for ‘h’ times. The original Transformer uses 8 heads.

Multi-Headed attention, image from article

Once we get the results of the 8 heads, we concatenate them into a matrix.

Multi-Headed attention, image from article

This is also straightforward to code, we just have to be careful with the dimensions:

class MultiHeadedAttention(nn.Module):
def __init__(self, h, d_model, dropout=0.1):
“Take in model size and number of heads.”
super(MultiHeadedAttention, self).__init__()
assert d_model % h == 0
# We assume d_v always equals d_k
self.d_k = d_model // h
self.h = h
self.linears = clones(nn.Linear(d_model, d_model), 4)
self.attn = None
self.dropout = nn.Dropout(p=dropout)

def forward(self, query, key, value, mask=None):
“Implements Figure 2”
if mask is not None:
# Same mask applied to all h heads.
mask = mask.unsqueeze(1)
nbatches = query.size(0)

# 1) Do all the linear projections in batch from d_model => h x d_k
query, key, value = [
lin(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
for lin, x in zip(self.linears, (query, key, value))
]

# 2) Apply attention on all the projected vectors in batch.
x, self.attn = attention(query, key, value, mask=mask, dropout=self.dropout)

# 3) “Concat” using a view and apply a final linear.
x = x.transpose(1, 2).contiguous().view(nbatches, -1, self.h * self.d_k)
del query
del key
del value
return self.linears[-1](x)

You should start to understand why Transformers are so powerful now, they exploit parallelism to the fullest.

Assembling the pieces of the Transformer

On the high-level, a Transformer is the combination of 3 elements: an Encoder, a Decoder, and a Generator

Endoder, Decoder, Generator, Image from article modified by author

1. The Encoder

Purpose: Convert an input sequence into a new sequence (usually of smaller dimension) that captures the essence of the original data.Note: If you’ve heard of the BERT model, it uses just this encoding part of the Transformer.

2. The Decoder

Purpose: Generate an output sequence using the encoded sequence from the Encoder.Note: The decoder in the Transformer is different from the typical autoencoder’s decoder. In the Transformer, the decoder not only looks at the encoded output but also considers the tokens it has generated so far.

3. The Generator

Purpose: Convert a vector to a token. It does this by projecting the vector to the size of the vocabulary and then picking the most likely token with the softmax function.

Let’s code that:

class EncoderDecoder(nn.Module):
“””
A standard Encoder-Decoder architecture. Base for this and many
other models.
“””

def __init__(self, encoder, decoder, src_embed, tgt_embed, generator):
super(EncoderDecoder, self).__init__()
self.encoder = encoder
self.decoder = decoder
self.src_embed = src_embed
self.tgt_embed = tgt_embed
self.generator = generator

def forward(self, src, tgt, src_mask, tgt_mask):
“Take in and process masked src and target sequences.”
return self.decode(self.encode(src, src_mask), src_mask, tgt, tgt_mask)

def encode(self, src, src_mask):
return self.encoder(self.src_embed(src), src_mask)

def decode(self, memory, src_mask, tgt, tgt_mask):
return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask)

class Generator(nn.Module):
“Define standard linear + softmax generation step.”

def __init__(self, d_model, vocab):
super(Generator, self).__init__()
self.proj = nn.Linear(d_model, vocab)

def forward(self, x):
return log_softmax(self.proj(x), dim=-1)

One remark here: “src” refers to the input sequence, and “target” refers to the sequence being generated. Remember that we generate the output in an autoregressive manner, token by token, so we need to keep track of the target sequence as well.

Stacking Encoders

The Transformer’s Encoder isn’t just one layer. It’s actually a stack of N layers. Specifically:

Encoder in the original Transformer model consists of a stack of N=6 identical layers.

Inside the Encoder layer, we can see that there are two Sublayer blocks which are very similar ((1) and (2)): A residual connection followed by a layer norm.

Block (1) Self-Attention Mechanism: Helps the encoder focus on different words in the input when generating the encoded representation.Block (2) Feed-Forward Neural Network: A small neural network applied independently to each position.Encoder Layer, residual connections, and Layer Norm,Image from article modified by author

Now let’s code that:

SublayerConnection first:

We follow the general architecture, and we can change “sublayer” by either “self-attention” or “FFN”

class SublayerConnection(nn.Module):
“””
A residual connection followed by a layer norm.
Note for code simplicity the norm is first as opposed to last.
“””

def __init__(self, size, dropout):
super(SublayerConnection, self).__init__()
self.norm = nn.LayerNorm(size) # Use PyTorch’s LayerNorm
self.dropout = nn.Dropout(dropout)

def forward(self, x, sublayer):
“Apply residual connection to any sublayer with the same size.”
return x + self.dropout(sublayer(self.norm(x)))

Now we can define the full Encoder layer:

class EncoderLayer(nn.Module):
“Encoder is made up of self-attn and feed forward (defined below)”

def __init__(self, size, self_attn, feed_forward, dropout):
super(EncoderLayer, self).__init__()
self.self_attn = self_attn
self.feed_forward = feed_forward
self.sublayer = clones(SublayerConnection(size, dropout), 2)
self.size = size

def forward(self, x, mask):
# self attention, block 1
x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
# feed forward, block 2
x = self.sublayer[1](x, self.feed_forward)
return x

The Encoder Layer is ready, now let’s just chain them together to form the full Encoder:

def clones(module, N):
“Produce N identical layers.”
return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])

class Encoder(nn.Module):
“Core encoder is a stack of N layers”

def __init__(self, layer, N):
super(Encoder, self).__init__()
self.layers = clones(layer, N)
self.norm = nn.LayerNorm(layer.size)

def forward(self, x, mask):
“Pass the input (and mask) through each layer in turn.”
for layer in self.layers:
x = layer(x, mask)
return self.norm(x)

Decoder

The Decoder, just like the Encoder, is structured with multiple identical layers stacked on top of each other. The number of these layers is typically 6 in the original Transformer model.

How is the Decoder different from the Encoder?

A third SubLayer is added to interact with the encoder: this is Cross-Attention

SubLayer (1) is the same as the Encoder. It’s the Self-Attention mechanism, meaning that we generate everything (Q, K, V) from the tokens fed into the DecoderSubLayer (2) is the new communication mechanism: Cross-Attention. It is called that way because we use the output from (1) to generate the Queries, and we use the output from the Encoder to generate the Keys and Values (K, V). In other words, to generate a sentence we have to look both at what we have generated so far by the Decoder (self-attention), and what we asked in the first place in the Encoder (cross-attention)SubLayer (3) is identical as in the Encoder.Decoder Layer, self attention, cross attention, Image from article modified by author

Now let’s code the DecoderLayer. If you understood the mechanism in the EncoderLayer, this should be quite straightforward.

class DecoderLayer(nn.Module):
“Decoder is made of self-attn, src-attn, and feed forward (defined below)”

def __init__(self, size, self_attn, src_attn, feed_forward, dropout):
super(DecoderLayer, self).__init__()
self.size = size
self.self_attn = self_attn
self.src_attn = src_attn
self.feed_forward = feed_forward
self.sublayer = clones(SublayerConnection(size, dropout), 3)

def forward(self, x, memory, src_mask, tgt_mask):
“Follow Figure 1 (right) for connections.”
m = memory
x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, tgt_mask))
# New sublayer (cross attention)
x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, src_mask))
return self.sublayer[2](x, self.feed_forward)

And now we can chain the N=6 DecoderLayers to form the Decoder:

class Decoder(nn.Module):
“Generic N layer decoder with masking.”

def __init__(self, layer, N):
super(Decoder, self).__init__()
self.layers = clones(layer, N)
self.norm = nn.LayerNorm(layer.size)

def forward(self, x, memory, src_mask, tgt_mask):
for layer in self.layers:
x = layer(x, memory, src_mask, tgt_mask)
return self.norm(x)

At this point you have understood around 90% of what a Transformer is. There are still a few details:

Transformer Model Details

Padding:

In a typical transformer, there’s a maximum length for sequences (e.g., “max_len=5000”). This defines the longest sequence the model can handle.However, real-world sentences can vary in length. To handle shorter sentences, we use padding.Padding is the addition of special “padding tokens” to make all sequences in a batch the same length.Padding, image by author

Masking

Masking ensures that during the attention computation, certain tokens are ignored.

Two scenarios for masking:

src_masking: Since we’ve added padding tokens to sequences, we don’t want the model to pay attention to these meaningless tokens. Hence, we mask them out.tgt_masking or Look-Ahead/Causal Masking: In the decoder, when generating tokens sequentially, each token should only be influenced by previous tokens and not future ones. For instance, when generating the 5th word in a sentence, it shouldn’t know about the 6th word. This ensures a sequential generation of tokens.Causal Masking/Look-Ahead masking, image by author

We then use this mask to add minus infinity so that the corresponding token is ignored. This example should clarify things:

Masking, a trick in the softmax, image by author

FFN: Feed Forward Network

The “Feed Forward” layer in the Transformer’s diagram is a tad misleading. It’s not just one operation, but a sequence of them.The FFN consists of two linear layers. Interestingly, the input data, which might be of dimension d_model=512, is first transformed into a higher dimension d_ff=2048 and then mapped back to its original dimension (d_model=512).This can be visualized as the data being “expanded” in the middle of the operation before being “compressed” back to its original size.Image from article modified by author

This is easy to code:

class PositionwiseFeedForward(nn.Module):
“Implements FFN equation.”

def __init__(self, d_model, d_ff, dropout=0.1):
super(PositionwiseFeedForward, self).__init__()
self.w_1 = nn.Linear(d_model, d_ff)
self.w_2 = nn.Linear(d_ff, d_model)
self.dropout = nn.Dropout(dropout)

def forward(self, x):
return self.w_2(self.dropout(self.w_1(x).relu()))

Conclusion

The unparalleled success and popularity of the Transformer model can be attributed to several key factors:

Flexibility. Transformers can work with any sequence of vectors. These vectors can be embeddings for words. It is easy to transpose this to Computer Vision by converting an image to different patches, and unfolding a patch into a vector. Or even in Audio, we can split an audio into different pieces and vectorize them.Generality: With minimal inductive bias, the Transformer is free to capture intricate and nuanced patterns in data, thereby enabling it to learn and generalize better.Speed & Efficiency: Leveraging the immense computational power of GPUs, Transformers are designed for parallel processing.

Thanks for reading! Before you go:

You can run the experiments with my Transformer Github Repository.

For more awesome tutorials, check my compilation of AI tutorials on Github

GitHub – FrancoisPorcher/awesome-ai-tutorials: The best collection of AI tutorials to make you a boss of Data Science!

You should get my articles in your inbox. Subscribe here.

If you want to have access to premium articles on Medium, you only need a membership for $5 a month. If you sign up with my link, you support me with a part of your fee without additional costs.

If you found this article insightful and beneficial, please consider following me and leaving a clap for more in-depth content! Your support helps me continue producing content that aids our collective understanding.

References

Attention is all you needThe Annotated Transformer (a good portion of the code is inspired from their blog post)Andrej Karpathy Stanford lecture

To go further

Even with a comprehensive guide, there are many other areas linked with Transformers. Here are some ideas you may want to explore:

Positional Encoding: significant improvements have been made, you may want to check “Relative positional Encoding” and “Rotary Positional Embedding (RoPE)Layer Norm, and the differences with batch norm, group normResidual connections, and their effect on smoothing the gradientImprovements made to BERT (Roberta, ELECTRA, Camembert)Distillation of large models into smaller modelsApplications of Transformers in other domains (mainly vision and audio)The link between Transformers and Graph Neural Networks

The A-Z of Transformers: Everything You Need to Know was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Comment