Unfolding the universe of possibilities..

Journeying through the galaxy of bits and bytes.

How to Train BERT for Masked Language Modeling Tasks

Hands-on guide to building language model for MLM tasks from scratch using Python and Transformers library


In recent years, large language models(LLMs) have taken all the attention from the machine learning community. Before LLMs came in, we had a crucial research phase on various language modeling techniques, including masked language modeling, causal language modeling, and sequence-to-sequence language modeling.

From the above list, masked language models such as BERT became more usable in downstream NLP tasks such as classification and clustering. Thanks to libraries such as Hugging Face Transformers, adapting these models for downstream tasks became more accessible and manageable. Also thanks to the open-source community, we have plenty of language models to choose from covering widely used languages and domains.

Fine-tune or build one from scratch?

When adapting existing language models to your specific use cases, sometimes we can use existing models without further tuning (so-called fine-tuning). For example, if you want an English sentiment/intent detection model, you can go into HuggingFace.co and find a suitable model for your use case.

However, you can only expect this for some of the tasks encountered in the real world. That’s where we need an additional technique called fine-tuning. First, you must choose a base model that will be fine-tuned. Here, you must be careful about the selected model and your target language’s lexical similarity.

However, if you can’t find a suitable model retrained on the desired language, consider building one from scratch. In this tutorial, we will implement the BERT model for the masked language model.

BERT Architecture

Even though describing BERT architecture is out of the scope of this tutorial, for the sake of clarity, let’s go through it very narrowly. BERT, or Bidirectional Encoder Representations from Transformers, belongs to the encoder-only transformer family. It was introduced in the year 2018 by researchers at Google.

Paper abstract:

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.
BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
Paper: https://arxiv.org/abs/1810.04805

In the above, we can see an interesting keyword, Bidirectional. Bidirectional nature gives human-like power to BERT. Assume you have to fill in a blank like the below one,

“War may sometimes be a necessary evil. But no matter how necessary, it is always an _____, never a good.”

To guess the word comes into a blank position, you should be aware of a few things: words before empty, words after blank, and the overall context of the sentence. Adapting this human nature, BERT works the same way. During training, we hide some words and ask BERT to try predicting those. When training is finished, BERT can predict masked tokens based on their before and after words. To do this, the model should allocate different attention to words presented in the input sequence, which may significantly impact predicting masked tokens.

Image by Author via https://huggingface.co/spaces/exbert-project/exbert

As you can see here, the model sees a suitable word for the hidden position as evil and the first sentence’s evil as necessary to make this prediction. This is a noticeable point and implies that the model understands the context of the input sequence. This context awareness allows BERT to generate meaningful sentence embeddings for given tasks. Further, these embeddings can be used in downstream tasks such as clustering and classification. Enough about BERT; let’s build one from scratch.

Defining BERT model

We generally have BERT(base) and BERT(large). Both have 64 dimensions per head. The large variant contains 24 encoder layers, while the base variant only has 12. However, we are not limited to these configurations. Surprisingly, we have complete control over defining the model using Hugging Face Transformers library. All we have to do is define desired model configurations using the BertConfig class.

I chose 6 heads and 384 total model dimensions to comply with the original implementation. In this way, each head has 64 dimensions similar to the original implementation. Let’s initialize our BERT model.

from transformers import BertConfig, BertForMaskedLM

config = BertConfig(
hidden_size = 384,
vocab_size= tokenizer.vocab_size,
num_hidden_layers = 6,
num_attention_heads = 6,
intermediate_size = 1024,
max_position_embeddings = 256

model = BertForMaskedLM(config=config)
print(model.num_parameters()) #10457864

Training a tokenizer

Here, I won’t describe how tokenization works under the hood. Instead, let’s train one from scratch using Hugging Face tokenizers library. Please note that the tokenizer used in the original BERT implementation is WordPiece tokenizer, yet another subword-based tokenization method. You can learn more about this tokenization using the neat HuggingFace resource below.

WordPiece tokenization – Hugging Face NLP Course

The dataset used here is the Sinhala-400M dataset (under apache-2.0). You can follow the same with any dataset you have.

As you may notice, some Sinhalese words have been typed using English as well. Let’s train a tokenizer for these corpora.

Let’s import the necessary modules first. The good thing about training tokenizers using Hugging Face Tokenizers library is that we can use existing tokenizers and replace only vocabulary (and merge where applicable) per our training corpus. This means tokenization steps such as pre-tokenization and post-tokenization will be preserved. For this, we can use a method, train_new_from_iterator BertTokenizer class.

from tokenizers.implementations import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing
from transformers import AutoTokenizer
from datasets import Dataset
import pandas as pd

#load base tokenizer to train on dataset
tokenizer_base = AutoTokenizer.from_pretrained(“bert-base-cased”)
# convert pandas dataset to HF dataset
dataset = Dataset.from_pandas(df.rename(columns={“comment”:’text’}))

# define iterator
training_corpus = (
dataset[i : i + 1000][“text”]
for i in range(0, len(dataset), 1000)

#train the new tokenizer for dataset
tokenizer = tokenizer_base.train_new_from_iterator(training_corpus, 5000)
#test trained tokenizer for sample text
text = dataset[‘text’][123]
print(text)# let’s check tokenization process
input_ids = tokenizer(text).input_ids
subword_view = [tokenizer.convert_ids_to_tokens(id) for id in input_ids]

You can see words like ‘cricketer’ decomposed into cricket and ##er, indicating that the tokenizer has been adequately trained. However, try out different vocab sizes; mine is 5000, which is relatively small but suitable for this toy example.

Finally, we can save the trained tokenizer into our directory.


Define data collator and tokenize dataset.

Let’s define a collator for MLM tasks. Here, we will mask 15% of tokens. Anyway, we can set different masking probabilities.

from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=True, mlm_probability=0.15

Let’s tokenize the dataset using a previously created tokenizer. I’m replacing the original LineByLineTextDataset using my custom class utilizing Hugging Face accelerate.

import torch
from torch.utils.data import Dataset
from accelerate import Accelerator, DistributedType

class LineByLineTextDataset(Dataset):
def __init__(self, tokenizer, raw_datasets, max_length: int):
self.padding = “max_length”
self.text_column_name = ‘text’
self.max_length = max_length
self.accelerator = Accelerator(gradient_accumulation_steps=1)
self.tokenizer = tokenizer

with self.accelerator.main_process_first():
self.tokenized_datasets = raw_datasets.map(
desc=”Running tokenizer on dataset line_by_line”,

def tokenize_function(self,examples):
examples[self.text_column_name] = [
line for line in examples[self.text_column_name] if len(line[0]) > 0 and not line[0].isspace()
return self.tokenizer(
def __len__(self):
return len(self.tokenized_datasets)

def __getitem__(self, i):
return self.tokenized_datasets[i]

Let’s tokenize the dataset.

tokenized_dataset_train = LineByLineTextDataset(
tokenizer= tokenizer,
raw_datasets = dataset,

Alright, let’s code our training loop.

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
logging_steps = 1000,
use_mps_device = True, # disable this if you’re running non-mac env
hub_private_repo = False, # please set true if you want to save model privetly
save_safetensors= True,
learning_rate = 1e-4,

trainer = Trainer(

We can invoke the trainer using its train() method.


After sufficient training, our model can be used for downstream tasks such as zero-shot classification and clustering. You may find the example using this Hugging Face space for more details.

Sinhala Embedding Space – a Hugging Face Space by Ransaka


With limited resources, pre-trained models may only recognize specific linguistic patterns, but they can still be helpful for particular use cases. It is highly recommended to fine-tune when possible.

In this article, all images, unless otherwise noted, are by the author.


An Explorable BERT — https://huggingface.co/spaces/exbert-project/exbertBERT Paper — https://arxiv.org/abs/1810.04805Dataset — https://huggingface.co/datasets/Ransaka/Sinhala-400M

How to Train BERT for Masked Language Modeling Tasks was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Comment