Simplifying Transformers: State of the Art NLP Using Words You Understand , Part 5: Decoder and Final Output
The final part of the Transformer series
Image from the original paper.
This 4th part of the series will be heavily based on the 2nd, 3rd, and 4th, so if you haven’t read through them and are not sure how the left side of the architecture in the image above works, I suggest you start there. Any glossary you find unclear and unexplained was probably gone through in previous sections.
Decoder
Transformers is an Encoder-Decoder architecture. Inputs go in, get encoded (transformed) into mathematical notations (numbers in some form, usually vectors). It’s then transferred into another processing unit called decoder where it’s translated back from numbers to the requested output. In the case of a Language Model that would be a word.
The decoder’s first important task is to create the Values matrix out of the target sequence (the answer we would like to give). It receives the entire target sequence, transforms it to Embeddings and adds positional encoding in the same way the encoder does it. It will then pass the embeddings to a Masked Multi-Head Attention layer which will create the Values matrix. This Matrix will assist the model in deciding how the user prompt and the expected target play together.
There is something of a paradox when explaining how Transformers work because you need to understand the pieces in order to understand how the end result happens, but you need to understand how the end result happens to understand the pieces. Taking that paradox into account we will take a short leap into the future and explain 2 things about how a Transformer trains:
First, in the encoder, the user prompt is converted into embeddings, and then we add positional encoding to that. The group of encoders (6 in the original paper) processes the data and generates a numerical representation of the text. Next, in the decoder, the ground truth (what we want the model to respond to) is prepended with some token indicating this is the first token of the sentence. Something like <BOS> (beginning of sentence) but it could be any other symbol the model is trained on. This input is converted to embeddings and we add positional encodings. The group of decoders (originally 6 as well) takes these vectors and together with the encoder’s output generates a new word representation. That word representation is converted into a probability distribution and the word (from the model’s entire dataset) with the highest probability is chosen. Finally, a loss function is calculated based on the gap between the model’s chosen word and the model’s ground truth. That loss is used to generate gradients that are important to the backpropagation (the algorithm that calculates how the weights should change according to their respective contribution to the overall error).
Now that we understand the general flow, let’s look at a small but important detail: a technique we use called Teacher Forcing.
Teacher Forcing
Think of a math test where you have 3 assignments:
1. Take the number 4, add 5, and keep the score.
2. Take the result from exercise 1, multiply by 2, and keep the score.
3. Take the result from exercise 2, and divide by 2.
You will be ranked on the result of each exercise separately, whether it’s right or wrong. Do you see the problem? If you’ve made a mistake in exercise number 1, you’re going to have a mistake in 2 and 3 as well. The Teacher Forcing technique handles that. As language models are also based on sequences (e.g. predicting the second word is dependent on the first word), in order to be futurely correct, you have to be previously correct. In our example, Teacher Forcing would be to give you the correct answer in exercise 2 and then the correct answer in exercise 3 so you are actually tested on multiplying/dividing and not on summation. You will still be scored separately on each exercise, you simply won’t suffer in exercise 2 because of mistakes you’ve made in exercise 1.
Specifically, in what we do, Teacher Forcing helps us train faster. We give the model the true label at each step, guaranteeing the prediction for the xth word. For example, the 4th word will be based on 3 ground truth (correct) labels and not on the predicted labels which might be a mistake and therefore not allow the model to continue correctly only because it made a previous error.
Okay, we got the important parts for now. Back to the main road. After transforming to embeddings and adding positional encodings, we pass the input through a Masked Multi-Head Attention.
Masked Multi-Head Attention
We have previously seen what attention layers do and why they exist. The masked attention layer is pretty much the same thing for the same reason, with one important difference. As the full expected output is processed by the decoder, it’s very simple for the model to use the entire sequence when building the attention scores. As these attention scores are a big part of the secret sauce, it’s important we get it right. Let’s presume I’m expected to produce the sentence: “Amazing technology, I love it”. At the current phase we’re in, we’re trying to predict the word technology. The word technology can have different representations in different contexts, but in inference time (when we actually use the model in real life), the model won’t have the entire sentence, it will only have the previous words. We therefore need to make sure the training also, only has access to the previous words it has predicted and not to future words it won’t have access to in real life. This is done by Masking (hiding) future words.
As you might suspect, as ML is deeply nested in math, we don’t just delete the word, we have a fancier way of doing it. We turn math to passive-aggressive mode. We ignore it. Specifically, when calculating the attention score we add minus infinity (a very small negative number, e.g. -86131676513135468) which will make softmax at the next stage turn these negative numbers to 0. This important technique makes sure the model won’t be able to use the next word when it doesn’t have access to it.
Image by Author
After calculating the masked attention scores, the input goes through an Add & Normalize layer, in the same way and for the same reason we previously explained. It also receives the skip connection from the layer previous to the attention calculation. After that, the Values matrix continues with us to the next stage. We are now getting the Q(uery) and the K(eys) matrices from the encoder, which represent the user prompt and the possible suggestions for that query. The decoder brings its own Values matrix to decide which part of the encoder input to focus on. Together with these 3 matrices (2 from the encoder, one from the decoder) we calculate a “regular” attention score.
Next, we have another Feed-Forward + Add & Norm layer which receives another skip connection, exactly like we’ve previously explained, and … we’re done with the decoder!
We now arrive at the final step. The final (6th) decoder in the stack of decoders passes its output through a linear layer. The linear layer enables us to generate as many numerical representations as we want. In a language model, we want the number of representations to match the size of the vocabulary. If the model’s entire vocabulary (all the unique words it has seen) is 1000, then we want 1000 numbers to represent each possible word in the vocabulary. We do this for every word in every position. If the final output contains 10 words, we calculate 1000 numbers for every one of these 10 words. We then pass it to a Softmax layer which gives us a probability for every word and the highest probability is that one we’ll be using. Softmax returns us an index, say 3. The model outputs the index 3 in the vocabulary and gets the new predicted word. If our vocabulary is [‘a’, ‘man’, ‘works’, ‘nice’, ‘go’], the chosen word will be ‘nice’.
And that is … all. We’re done! You now have a good understanding of how the Transformers architecture works, end to end. It’s been quite a journey, and you went through it bravely. I hope the series helps understand this complicated yet understandable and important subject.
Have a good time.
Simplifying Transformers: State of the Art NLP Using Words You Understand — part 5— Decoder and… was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.