Unfolding the universe of possibilities..

Whispers from the digital wind, hang tight..

Large Language Models: TinyBERT — Distilling BERT for NLP

Large Language Models: TinyBERT — Distilling BERT for NLP

Unlocking the power of Transformer distillation in LLMs


In recent years, the evolution of large language models has skyrocketed. BERT became one of the most popular and efficient models allowing to solve a wide range of NLP tasks with high accuracy. After BERT, a set of other models appeared later on the scene demonstrating outstanding results as well.

The obvious trend that became easy to observe is the fact that with time large language models (LLMs) tend to become more complex by exponentially augmenting the number of parameters and data they are trained on. Research in deep learning showed that such techniques usually lead to better results. Unfortunately, the machine learning world has already dealt with several problems regarding LLMs and scalability has become the main obstacle in effective training, storing and using them.

By taking into consideration this issue, special methods have been elaborated for compressing LLMs. In this article, we will focus on Transformer distillation which led to the development of a small version of BERT called TinyBERT. Additionally, we will understand the learning process in TinyBERT and several subtleties that make TinyBERT so robust. This article is based on the official TinyBERT paper.

Main idea

Recently we have already covered how distillation works in DistilBERT: in short words, the loss function objective is modified in a way to make the predictions of the student and teacher similar. In DistilBERT, the loss function compares the output distributions of the student and teacher and also takes into consideration the output embeddings of both models (for similarity loss).

Large Language Models: DistilBERT — Smaller, Faster, Cheaper and Lighter

On the surface, the distillation framework in TinyBERT does not change that much from DistilBERT: the loss function is again modified to make the student imitate the teacher. However, in the case of TinyBERT, it goes a step beyond: the loss function takes into consideration not only WHAT both models produce but also HOW predictions are obtained. According to the paper, the TinyBERT loss function consists of three components that cover different aspects of both models:

the output of the embedding layerthe hidden states and attention matrices derived from the Transformer layer

3. the logits output by the prediction layer

Transformer distillation lossesWhat is the point of comparing the hidden states of both models? Including the outputs of hidden states and attention, matrices makes it possible for the student to learn the hidden layers of the teacher, thus constructing layers similar to those of the teacher. This way, the distilled model does not only imitate the output of the original model but also its inner behaviour.Why is it important to replicate the teacher’s behaviour? The researchers claim that the attention weights learned by BERT can be beneficial for capturing language structure. Therefore, their distillation to another model also gives the student more chances to gain linguistic knowledge.

Layer mapping

Representing a smaller BERT version, TinyBERT has fewer encoder layers. Let us define the number of BERT layers as N, and the number of those of TinyBERT as M. Given the fact that the number of layers is different, it is not obvious how it would be possible to calculate the distillation loss.

For this purpose, a special function n = g(m) is introduced to define which BERT layer n is used to distillate its knowledge to a corresponding layer m in TinyBERT. The chosen BERT layers are then used for loss calculation during training.

The introduced function n = g(m) has two reasoning constraints:

g(0) = 0. This means that the embedding layer in BERT is mapped directly to the embedding layer in TinyBERT which makes sense.g(M + 1) = N + 1. The equation indicates that the prediction layer in BERT is mapped to the prediction layer in TinyBERT.

For all other TinyBERT layers 1 ≤ m ≤ M, the corresponding function values of n = g(m) need to be mapped. For now, let suppose that such function is defined. The TinyBERT settings will be studied later in this article.

Transformer distillation

1. Embedding-layer distillation

Before raw input is passed to the model, it is firstly tokenized and then mapped to learned embeddings. These embeddings are then used as the first layer of the model. All the possible embeddings can be expressed in the form of a matrix. To compare how much different the student and teacher embeddings are, it is possible to use a standard regression metric applied on their respective embedding matrices E. For instance, transformer distillation uses MSE as a regression metric.

Since student and teacher embedding matrices have different sizes, it is not possible to compare them element-wisely by using MSE. That is why, the student embedding matrix is multiplied by a learnable weight matrix W, so the resulting matrix is of the same shape as the teacher embedding matrix.Embedding-layer distillation lossSince the embedding spaces of the student and teacher are different, matrix W also plays an important role in linearly transforming the embedding space of a student to that of the teacher.

2. Transformer-layer distillation

Transformer-layer distillation loss visualisation

2A. Attention-layer distillation

At its core, the multi-head attention mechanism in Transformer produces several attention matrices containing rich linguistic knowledge. By transferring the attention weights from the teacher, the student can also understand important language concepts. To implement this idea, the loss function is used to calculate the differences between student and teacher attention weights.

In TinyBERT, all the attention layers are considered and the resulting loss value for each layer equals the sum of MSE values between respective student and teacher attention matrices for all heads.

Attention-layer distillation lossThe attention matrices A used for attention-layer distillation are unnormalized, instead of their softmax output softmax(A). According to the researchers, this subtlety leads to faster convergence and improved performance.

2B. Hidden-layer distillation

Following the idea of capturing rich linguistic knowledge, the distillation is applied to the outputs of transformer layers as well.

Hidden-layer distillation loss

The weight matrix W plays the same role as the one described above for embedding-layer distillation.

3. Prediction-layer distillation

Finally, to make the student reproduce an output of the teacher, the prediction-layer loss is considered. It consists of computing cross-entropy between predicted logit vectors by both models.

Prediction-layer distillation loss

Sometimes, the logits are divided by temperature parameter T which controls the smoothness of an output distribution. In TinyBERT, the temperature T is set to 1.

Loss equation

In TinyBERT, based on its type, each layer has its own loss function. To give some layers more or less importance, corresponding loss values are multiplied by a constant a. The ultimate loss function equals a weighted sum of loss values on all TinyBERT layers.

Loss function in TinyBERTIn numerous experiments, it was shown that among three loss components, the transformer-layer distillation loss has the highest impact on the model’s performance.


It is important to note that most NLP models (including BERT) are developed in two stages:

The model is pretrained on a large corpus of data to gain a general knowledge of the language structure.The model is fine-tuned on another dataset to solve a specific downstream task.

Following the same paradigm, the researchers developed a framework in which TinyBERT learning process also consists of two stages. In both training stages the Transformer distillation is used to transfer BERT knowledge to TinyBERT.

General distillation. TinyBERT gains rich general knowledge about the language structure from pre-trained BERT (without fine-tuning) acting as a teacher. By using fewer layers and parameters, after this stage, TinyBERT performs generally worse than BERT.Task-specific distillation. This time, the fine-tuned version of BERT plays the role of the teacher. To further improve performance, as proposed by the researchers, the data augmentation method is applied on the training dataset. Results show that after the task-specific distillation, TinyBERT achieves comparable performance regarding BERT.Training process

Data augmentation

A special data augmentation technique was elaborated for task-specific distillation. It consists of taking sequences from a given dataset and substituting a percentage of words in one of two ways:

If the word is tokenized into the same word, then this word is predicted by BERT model and the predicted word replaces the original word in the sequence.If the word is tokenized into several subwords, then those subwords are replaced by the most similar GloVe embeddings.

Despite a considerable reduction of the model size, the described data augmentation mechanism makes a high impact on TinyBERT performance by allowing to it to learn more diverse examples.

Augmentation example

Model settings

By having only 14.5M parameters, TinyBERT is about 7.5x smaller than BERT base. Their detailed comparison is demonstrated in the figure below:

BERT base vs TinyBERT comparison

For the layer mapping, the authors propose a uniform strategy according to which the layer mapping function maps each TinyBERT layer to each third BERT layer: g(m) = 3 * m. Other strategies were also studied (like taking all bottom or top BERT layers) but the uniform strategy showed the best results which seems logical because it allows to transfer knowledge from different abstraction layers making the transferred information more varied.

Different layer mapping strategies. Performance results are shown for the GLUE dataset.

Speaking of the training process, TinyBERT is trained on English Wikipedia (2500M words) and has most of its hyperparameters the same as in BERT base.


Transformer distillation is a big step in natural language processing. Taking into consideration that Transformer-based models are one of the most powerful at the moment in machine learning, we can further cherish them by applying Transformer distillation to effectively compress them. One of the greatest examples is TinyBERT which is compressed by 7.5x times from BERT base.

Despite such a huge reduction of parameters, experiments show that TinyBERT demonstrates comparable performance with BERT base: achieving a 77.0% score on the GLUE benchmark, TinyBERT is not far away from BERT whose score equals 79.5%. Obviously, this is an amazing achievement! Finally, other popular compression techniques like quantization or pruning can be applied to TinyBERT to make it even smaller.


TinyBERT: Distilling BERT for Natural Language Understanding

All images unless otherwise noted are by the author

Large Language Models: TinyBERT — Distilling BERT for NLP was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Comment