Unfolding the universe of possibilities..

Whispers from the digital wind, hang tight..

Build a Language Model on your WhatsApp Chats

Build a Language Model on Your WhatsApp Chats

A visual guide through the GPT architecture with an application

Photo by Volodymyr Hryshchenko on Unsplash

Chatbots have undeniably transformed our interaction with digital platforms. Despite the impressive advancements in the capabilities of underlying language models to handle complex tasks, the user experience often falls short, feeling impersonal and detached.

To make conversations more relatable, I envisioned a chatbot that could emulate my casual writing style, akin to texting a friend over WhatsApp.

In this article, I’ll walk you through my journey of building a (small) language model that generates synthetic conversations, using my WhatsApp chat messages as input data. Along the way, I try to unravel the inner workings of the GPT architecture in a visual and hopefully digestible way, complemented by the actual Python implementation. You can find the full project on my GitHub.

Note: The model class itself is in large chunks taken from the video series of Andrej Karpathy and adapted for my needs. I can highly recommend his tutorials.

GitHub – bernhard-pfann/lad-gpt

Table of Contents

Selected ApproachData SourceTokenizationIndexingModel ArchitectureModel TrainingChat-Mode

1. Selected Approach

When it comes to tailoring a language model to a specific corpus of data, there are several approaches one can take:

Model building: This involves constructing and training a model from scratch, providing the utmost flexibility in terms of model architecture and training data selection.Fine-tuning: This approach leverages an existing pre-trained model, adjusting its weights to align more closely with the specific data at hand.Prompt engineering: This also utilizes an existing pre-trained model, but here, the unique corpus is directly incorporated into the prompt, without changing the model’s weights.

As my motivation for this project is mainly self-educational, and I am rather interested in the architecture of modern language models, I opted for the first approach. Yet, this choice comes with obvious limitations. Given the size of my data and available computational resources, I didn’t anticipate results on par with any state-of-the-art pre-trained models.

Nevertheless, I was hopeful that my model would pick up some interesting linguistic patterns, which it ultimately did. Exploring the second option (fine-tuning) might be the focus of a future article.

2. Data Source

WhatsApp, my primary communication channel, was an ideal source for capturing my conversational style. Exporting over six years of group chat history, totaling more than 1.5 million words was straightforward.

The data was parsed using a regex pattern into a list of tuples containing the date, contact name, and chat message.

pattern = r'[(.*?)] (.*?): (.*)’
matches = re.findall(pattern, text)
text = [(x1, x2.lower()) for x0, x1, x2 in matches][
(2018-03-12 16:03:59, “Alice”, “Hi, how are you guys?”),
(2018-03-12 16:05:36, “Tom”, “I am good thanks!”),


Now, each element was processed separately.

Sending date: Aside from converting it into a datetime object, I have not utilized this information. However, one could look at the time deltas to differentiate the start and end of topic discussions.Contact name: When tokenizing the text, each contact name is treated as a unique token. This ensures that the combination of first and last names is still considered a single entity.Chat message: A special “<END>” token is added at the end of each message.

3. Tokenization

To train a language model, we need to break language into pieces (so-called tokens) and feed them to the model incrementally. Tokenization can be performed on multiple levels.

Character-level: Text is perceived as a sequence of individual characters (including white spaces). This granular approach allows every possible word to be formed from a sequence of characters. However, it is more difficult to capture semantic relationships between words.Word-level: Text is represented as a sequence of words. However, the model’s vocabulary is limited by the existing words in the training data.Sub-word-level: Text is broken down into sub-word units, which are smaller than words but larger than characters.

While I started off with a character-level tokenizer, I felt that training time was wasted, learning character sequences of repetitive words, rather than focusing on the semantic relationship between words across the sentence.

For the sake of conceptual simplicity, I decided to switch to a word-level tokenizer, keeping aside the available libraries for more sophisticated tokenization strategies.

from nltk.tokenize import RegexpTokenizer

def custom_tokenizer(txt: str, spec_tokens: List[str], pattern: str=”|d|\w+|[^\s]”) -> List[str]:
Tokenize text into words or characters using NLTK’s RegexpTokenizer, considerung
given special combinations as single tokens.

:param txt: The corpus as a single string element.
:param spec_tokens: A list of special tokens (e.g. ending, out-of-vocab).
:param pattern: By default the corpus is tokenized on a word level (split by spaces).
Numbers are considered single tokens.

>> note: The pattern for character level tokenization is ‘|.’
pattern = “|”.join(spec_tokens) + pattern
tokenizer = RegexpTokenizer(pattern)
tokens = tokenizer.tokenize(txt)
return tokens[“Alice:”, “Hi”, “how”, “are”, “you”, “guys”, “?”, “<END>”, “Tom:”, … ]

It turned out that my training data has a vocabulary of ~70,000 unique words. However, as many words appear only once or twice, I decided to replace such rare words by a “<UNK>” special token. This had the effect of reducing vocabulary to ~25,000 words, which leads to a smaller model that needs to be trained later.

from collections import Counter

def get_infrequent_tokens(tokens: Union[List[str], str], min_count: int) -> List[str]:
Identify tokens that appear less than a minimum count.

:param tokens: When it is the raw text in a string, frequencies are counted on character level.
When it is the tokenized corpus as list, frequencies are counted on token level.
:min_count: Threshold of occurence to flag a token.
:return: List of tokens that appear infrequently.
counts = Counter(tokens)
infreq_tokens = set([k for k,v in counts.items() if v<=min_count])
return infreq_tokens

def mask_tokens(tokens: List[str], mask: Set[str]) -> List[str]:
Iterate through all tokens. Any token that is part of the set, is replaced by the unknown token.

:param tokens: The tokenized corpus.
:param mask: Set of tokens that shall be masked in the corpus.
:return: List of tokenized corpus after the masking operation.
return [t.replace(t, unknown_token) if t in mask else t for t in tokens]

infreq_tokens = get_infrequent_tokens(tokens, min_count=2)
tokens = mask_tokens(tokens, infreq_tokens)[“Alice:”, “Hi”, “how”, “are”, “you”, “<UNK>”, “?”, “<END>”, “Tom:”, … ]

4. Indexing

After tokenization, the next step is to convert the words and special tokens into numerical representations. Using a fixed vocabulary list, each word was indexed by its position. The encoded words were then prepared as PyTorch tensors.

import torch

def encode(s: list, vocab: list) -> torch.tensor:
Encode a list of tokens into a tensor of integers, given a fixed vocabulary.
When a token is not found in the vocabulary, the special unknown token is assigned.
When the training set did not use that special token, a random token is assigned.
rand_token = random.randint(0, len(vocab))

map = {s:i for i,s in enumerate(vocab)}
enc = [map.get(c, map.get(unknown_token, rand_token)) for c in s]
enc = torch.tensor(enc, dtype=torch.long)
return enctorch.tensor([8127, 115, 2363, 3, …, 14028])

As we need to evaluate the quality of our model against some unseen data, we split the tensor into two parts. And voila, we have our training and validation sets, ready to feed to the language model.

Image by author

5. Model Architecture

I decided to apply the GPT architecture, which has been promoted by the influential paper “Attention is All you Need”. Since I tried to build a language generator and not a question-answer bot, the decoder-only (right side) architecture was sufficient for this purpose.

Attention is All you Need” by A. Vaswani et. al. (Retrieved from arXiv: 1706.03762)

In the following sections, I will break down each component of the GPT architecture, explaining its role and the underlying matrix operations. Starting off with the prepared training test, I will trace an exemplary context of 3 words through the model, until it leads to a prediction of the next token.

5.1. Model Objective

Before delving into technical specifics, it’s crucial to understand our model’s primary objective. In a decoder-only setup, our aim is to decode the structure of language to accurately predict the next token in a sequence, given the context of preceding tokens.

Image by author

As we feed our indexed token sequence into the model, it undergoes a series of matrix multiplications with various weight matrices. The output is a vector representing the probability of each token being the next in the sequence, based on the input context.

Model Evaluation:

Our model’s performance is evaluated against the training data, where the actual next token is known. The objective is to maximize the probability of correctly predicting this next token.

However, in machine learning, we often focus on the concept of “loss”, which quantifies the error or the likelihood of incorrect predictions. To calculate this, we compare the model’s output probabilities with the actual next token (using cross-entropy).


With an understanding of our current loss, we aim to minimize it through backpropagation. This process involves iteratively feeding token sequences into the model and adjusting the weight matrices to enhance performance.

In each figure, I will highlight in yellow the weight matrices that will be optimized during that procedure.

5.2. Output Embedding

Up to this point, each token in our sequence has been represented by an integer index. However, this simplistic form doesn’t reflect word relationships or similarities. To address this, we elevate our one-dimensional indices into higher-dimensional spaces through embedding.

Word-embeddings: The essence of a word is captured by an n-dimensional vector of floats.Positional-embeddings: These embeddings highlight the importance of a word’s position within a sentence, also represented as n-dimensional vectors of floats.

For each token, we look up its word-embedding and positional embedding and then sum them up element-wise. This results in the output embedding of each token in the context.

In the below example, the context consists of 3 tokens. At the end of the embedding process, each token is represented by an n-dimensional vector (where n is the embedding size, a tunable hyperparameter).

Image by author

PyTorch offers dedicated classes for such embeddings. Within our model class, we define the word-, and positional embeddings as follows (passing matrix dimensions as arguments):

self.word_embedding = nn.Embedding(vocab_size, embed_size)
self.pos_embedding = nn.Embedding(block_size, embed_size)

5.3. Self-Attention Head

While word embeddings provide a general sense of word similarity, the true meaning of a word often hinges on its surrounding context. For example, “bat” could refer to either an animal or a sports equipment, depending on the sentence. This is where the self-attention mechanism, a key component of the GPT architecture, comes into play.

The self-attention mechanism focuses on three main concepts: Query (Q), Key (K), and Value (V).

Query (Q): The query is essentially a representation of the current token for which the attention needs to be calculated. It’s like asking, “What should I, as the current token, pay attention to in the rest of the context?”Keys (K): Keys are representations of each token in the input sequence. They are paired with the Query to determine the attention scores. This comparison measures how much focus the query token should put on other tokens in the context. High scores mean that more attention should be paid.Value (V): Values are also representations of each token in the input sequence. However, their role is different, as they apply a final weighting to the attention-scores.Image by author


In our example, each token of the context is already in embedded form, as n-dimension vectors (e1, e2, e3). The self-attention head takes them as input, to output a contextualized version for each of them, one at a time.

When evaluating the token “name”, a query vector q is obtained by multiplying its embedded vector v2 with the trainable matrix M_Q.At the same time key vectors (k1, k2, k3) are calculated for each token in the context, by multiplying each embedded vector (e1, e2, e3) with the trainable matrix M_K.The value vectors (v1, v2, v3) are obtained in the same way, just multiplied by a different trainable matrix M_V.The attention-scores w are calculated as dot-product between the query vector and each key vector separately.Finally we stack all value vectors to a matrix, and multiply that by the attention scores, to obtained the contextualized vector for the token “name”.class Head(nn.Module):
This module performs self-attention operations on the input tensor, producing
an output tensor with the same time-steps but different channels.

:param head_size: The size of the head in the multi-head attention mechanism.
def __init__(self, head_size):
self.key = nn.Linear(embed_size, head_size, bias=False)
self.query = nn.Linear(embed_size, head_size, bias=False)
self.value = nn.Linear(embed_size, head_size, bias=False)

def forward(self, x):
# input of size (batch, time-step, channels)
# output of size (batch, time-step, head size)
B,T,C = x.shape
k = self.key(x)
q = self.query(x)

# compute attention scores
wei = q @ k.transpose(-2,-1)
wei /= math.sqrt(k.shape[-1])

# avoid look-ahead
tril = torch.tril(torch.ones(T, T))
wei = wei.masked_fill(tril == 0, float(“-inf”))
wei = F.softmax(wei, dim=-1)

# weighted aggregation of the values
v = self.value(x)
out = wei @ v
return out

5.4. Masked Multi-Head Attention

Language is complex, and capturing all its nuances isn’t straightforward. A single set of attention calculations often isn’t enough to catch the subtleties of how words work together. That’s where the idea of multi-head attention in the GPT model comes in handy.

Think of multi-head attention as having several pairs of eyes looking at the data in different ways, each noticing unique details. These separate observations are then put together into one big picture. To keep this big picture manageable and compatible with the rest of our model, we use a linear layer (trainable weights) to compress it back to our original embedding size.

Finally, to make sure our model doesn’t just memorize the training data but also gets good at making predictions on new text, we use a dropout layer. This randomly turns off parts of the data during training, which helps the model become more adaptable.

Image by authorclass MultiHeadAttention(nn.Module):
This class contains multiple `Head` objects, which perform self-attention
operations in parallel.
def __init__(self):
head_size = embed_size // n_heads
heads_list = [Head(head_size) for _ in range(n_heads)]

self.heads = nn.ModuleList(heads_list)
self.linear = nn.Linear(n_heads * head_size, embed_size)
self.dropout = nn.Dropout(dropout)

def forward(self, x):
heads_list = [h(x) for h in self.heads]
out = torch.cat(heads_list, dim=-1)
out = self.linear(out)
out = self.dropout(out)
return out

5.5. Feed Forward

The multi-head attention layer initially captures the contextual relationships within the sequence. More depth is added to the network via two consecutive linear layers, which collectively constitute the feed-forward neural network.

Image by author

In the initial linear layer, we increase the dimensionality (by a factor of 4 in our case) which effectively broadens the network’s capacity to learn and represent more complex features. A ReLU function is applied on each element of the resulting matrix, which enables non-linear pattern be to recognized.

Subsequently, the second linear layer acts as a compressor, reducing the expanded dimensions back to the original shape (block-size x embedding-size). A dropout layer concludes the process, randomly deactivating elements of the matrix, for the sake of model generalization.

class FeedFoward(nn.Module):
This module passes the input tensor through a series of linear transformations
and non-linear activations.
def __init__(self):
self.net = nn.Sequential(
nn.Linear(embed_size, 4 * embed_size),
nn.Linear(4 * embed_size, embed_size),

def forward(self, x):
return self.net(x)

5.6. Add & Norm

Now we link together the multi-head attention and feed-forward component, by introducing two more crucial elements:

Residual Connections (Add): These connections perform an element-wise addition of a layer’s output to its unaltered input. During training, the model adjusts the emphasis on layer transformations based on their usefulness. If a transformation is deemed nonessential, its weights and consequently its layer output tend towards zero. In this case at least the unaltered input is passed through the residual connection. This technique helps mitigating the vanishing gradient problem.Layer Normalization (Norm): This method normalizes each embedded vector in the context by subtracting its mean and dividing by its standard deviation. This process also ensures that the gradients during backpropagation neither explode nor vanish.Image by author

The chain of multi-head attention and feed-forward layers, linked with “Add & Norm” is consolidated into a block. This modular design allows us to form a sequence of blocks. The number of these blocks is a hyper-parameter, which determines the depth of the model architecture.

class Block(nn.Module):
This module contains a single transformer block, which consists of multi-head
self-attention followed by feed-forward neural networks.
def __init__(self):

self.sa = MultiHeadAttention()
self.ffwd = FeedFoward()
self.ln1 = nn.LayerNorm(embed_size)
self.ln2 = nn.LayerNorm(embed_size)

def forward(self, x):
x = x + self.sa(self.ln1(x))
x = x + self.ffwd(self.ln2(x))
return x

5.7. Softmax

Upon traversing multiple block components, we obtain a matrix of dimensions (block-size x embed-size). To reshape this matrix to the required dimensions (block-size x vocab size), we pass it through a final linear layer. This shape represents an entry for each word in the vocabulary at each position in the context.

Finally, we apply the soft-max transformation to these values, converting them into probabilities. We have successfully obtained a probability distribution for the next token at every position in the context.

6. Model Training

To train the language model, I selected token sequences from random positions within my training data. Given the fast-paced nature of WhatsApp conversations, I determined a context length of 32 words to be sufficient. Thus, I chose random 32-word chunks as the context input and used the corresponding vectors, shifted by one word, as the targets for comparison.

The training process was looping through the following steps:

Sample multiple batches of context.Feed these samples into the model to calculate the current loss.Apply backpropagation based on the current loss and model weights.Evaluate the loss more comprehensively every 500th iteration.

Once all other model hyperparameters (like embedding size, number of self-attention heads, etc.) were fixed, I finalized a model with 2.5 million parameters. Given my limitations on input data size and computational resources, I found this to be the optimal setup for me.

The training process took approximately 12 hours for 10,000 iterations. One can see that training could have been stopped earlier, as the spread between the loss on the validation and training sets widens.

Image by authorimport json

import torch

from config import eval_interval, learn_rate, max_iters
from src.model import GPTLanguageModel
from src.utils import current_time, estimate_loss, get_batch

def model_training(update: bool) -> None:
Trains or updates a GPTLanguageModel using pre-loaded data.

This function either initializes a new model or loads an existing model based
on the `update` parameter. It then trains the model using the AdamW optimizer
on the training and validation data sets. Finally the trained model is saved.

:param update: Boolean flag to indicate whether to update an existing model.
# LOAD DATA —————————————————————–

train_data = torch.load(“assets/output/train.pt”)
valid_data = torch.load(“assets/output/valid.pt”)

with open(“assets/output/vocab.txt”, “r”, encoding=”utf-8″) as f:
vocab = json.loads(f.read())

# INITIALIZE / LOAD MODEL —————————————————

if update:
model = torch.load(“assets/models/model.pt”)
print(“Loaded existing model to continue training.”)
except FileNotFoundError:
print(“No existing model found. Initializing a new model.”)
model = GPTLanguageModel(vocab_size=len(vocab))

print(“Initializing a new model.”)
model = GPTLanguageModel(vocab_size=len(vocab))

# initialize optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learn_rate)

# number of model parameters
n_params = sum(p.numel() for p in model.parameters())
print(f”Parameters to be optimized: {n_params}n”, )

# MODEL TRAINING ————————————————————

for i in range(max_iters):

# evaluate the loss on train and valid sets every ‘eval_interval’ steps
if i % eval_interval == 0 or i == max_iters – 1:
train_loss = estimate_loss(model, train_data)
valid_loss = estimate_loss(model, valid_data)

time = current_time()
print(f”{time} | step {i}: train loss {train_loss:.4f}, valid loss {valid_loss:.4f}”)

# sample batch of data
x_batch, y_batch = get_batch(train_data)

# evaluate the loss
logits, loss = model(x_batch, y_batch)

torch.save(model, “assets/models/model.pt”)
print(“Model saved”)

7. Chat-Mode

To interact with the trained model, I created a function that allows selecting a contact name via a dropdown menu and inputting a message for the model to respond to. The parameter “n_chats” determines the number of responses the model generates at once. The model concludes a generated message when it predicts the <END> token as the next token.

import json
import random

import torch
from prompt_toolkit import prompt
from prompt_toolkit.completion import WordCompleter

from config import end_token, n_chats
from src.utils import custom_tokenizer, decode, encode, print_delayed

def conversation() -> None:
Emulates chat conversations by sampling from a pre-trained GPTLanguageModel.

This function loads a trained GPTLanguageModel along with vocabulary and
the list of special tokens. It then enters into a loop where the user specifies
a contact. Given this input, the model generates a sample response. The conversation
continues until the user inputs the end token.
with open(“assets/output/vocab.txt”, “r”, encoding=”utf-8″) as f:
vocab = json.loads(f.read())

with open(“assets/output/contacts.txt”, “r”, encoding=”utf-8″) as f:
contacts = json.loads(f.read())

spec_tokens = contacts + [end_token]
model = torch.load(“assets/models/model.pt”)
completer = WordCompleter(spec_tokens, ignore_case=True)

input = prompt(“message >> “, completer=completer, default=””)
output = torch.tensor([], dtype=torch.long)

while input != end_token:
for _ in range(n_chats):

add_tokens = custom_tokenizer(input, spec_tokens)
add_context = encode(add_tokens, vocab)
context = torch.cat((output, add_context)).unsqueeze(1).T

n0 = len(output)
output = model.generate(context, vocab)
n1 = len(output)

print_delayed(decode(output[n0-n1:], vocab))
input = random.choice(contacts)

input = prompt(“nresponse >> “, completer=completer, default=””)


Due to the privacy of my personal chats, I am unable to present example prompts and conversations here.

Nonetheless, you can expect a model of this scale to successfully learn the general structure of sentences, producing meaningful outputs in terms of word order. In my case, it also picked up context for certain topics that were prominent in the training data. For instance, as my personal chats often revolve around tennis, the names of tennis players and tennis-related words were typically output together.

However, when evaluating the coherence of the generated sentences, I concede that the results did not quite meet my already modest expectations. But of course, I could also blame my friends for chatting too much nonsense, limiting the models’ ability to learn something useful…

To show at least some example output at the end, you can see how the dummy model performs on 200 trained dummy messages 😉

Image by author

Build a Language Model on your WhatsApp Chats was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Comment