Data Curation, Transformers, Training at Scale, and Model Evaluation
This is the 6th article in a series on using large language models (LLMs) in practice. Previous articles explored how to leverage pre-trained LLMs via prompt engineering and fine-tuning. While these approaches can handle the overwhelming majority of LLM use cases, it may make sense to build an LLM from scratch in some situations. In this article, we will review key aspects of developing a foundation LLM based on the development of models such as GPT-3, Llama, Falcon, and beyond.
Photo by Frames For Your Heart on Unsplash
Historically (i.e. less than 1 year ago), training large-scale language models (10b+ parameters) was an esoteric activity reserved for AI researchers. However, with all the AI and LLM excitement post-ChatGPT, we now have an environment where businesses and other organizations have an interest in developing their own custom LLMs from scratch [1]. Although this is not necessary (IMO) for >99% of LLM applications, it is still beneficial to understand what it takes to develop these large-scale models and when it makes sense to build them.
How much does it cost?
Before diving into the technical aspects of LLM development, let’s do some back-of-the-napkin math to get a sense of the financial costs here.
Meta’s Llama 2 models required about 180,000 GPU hours to train its 7b parameter model and 1,700,000 GPU hours to train the 70b model [2]. Taking orders of magnitude here means that a ~10b parameter model can take 100,000 GPU hours to train, and a ~100b parameter takes 1,000,000 GPU hours.
Translating this into commercial cloud computing costs, an Invidia A100 GPU (i.e. what was used to train Llama 2 models) costs around $1–2 per GPU per hour. That means a ~10b parameter model costs about $150,000 to train, and a ~100b parameter model costs ~$1,500,000.
Alternatively, you can buy the GPUs if you don’t want to rent them. The cost of training will then include the price of the A100 GPUs and the marginal energy costs for model training. An A100 is about $10,000 multiplied by 1000 GPUs to form a cluster. The hardware cost is then on the order of $10,000,000. Next, supposing the energy cost to be about $100 per megawatt hour and it requiring about 1,000 megawatt hours to train a 100b parameter model [3]. That comes to a marginal energy cost of about $100,000 per 100b parameter model.
These costs do not include funding a team of ML engineers, data engineers, data scientists, and others needed for model development, which can easily get to $1,000,000 (to get people who know what they are doing).
Needless to say, training an LLM from scratch is a massive investment (at least for now). Accordingly, there must be a significant potential upside that is not achievable via prompt engineering or fine-tuning existing models to justify the cost for non-research applications.
4 Key Steps
Now that you’ve realized you do not want to train an LLM from scratch (or maybe you still do, IDK), let’s see what model development consists of. Here, I break the process down into 4 key steps.
Data CurationModel ArchitectureTraining at ScaleEvaluation
Although each step has a bottomless depth of technical detail, the discussion here will stay relatively high-level, only highlighting a handful of key details. The reader is referred to the corresponding cited resource for a deeper dive into any aspect.
Step 1: Data Curation
Machine learning models are a product of their training data, which means the quality of your model is driven by the quality of your data (i.e. “garbage in, garbage out”).
This presents a major challenge for LLMs due to the tremendous scale of data required. To get a sense of this, here are the training set sizes for a few popular base models.
GPT-3 175b: 0.5T Tokens [4] (T = Trillion)Llama 70b: 2T tokens [2]Falcon 180b: 3.5T [5]
This translates to about a trillion words of text i.e. about 1,000,000 novels or 1,000,000,000 news articles. Note: if you are unfamiliar with the term token, check out the explanation in a previous article of this series.
Cracking Open the OpenAI (Python) API
Where do we get all these data?
The internet is the most common LLM data mine, which includes countless text sources such as webpages, books, scientific articles, codebases, and conversational data. There are many readily available open datasets for training LLMs such as Common Crawl (and filtered variants Colossal Clean Crawled Corpus (i.e. C4), and Falcon RefinedWeb), The Pile (a cleaned and diverse 825 GB dataset) [6], and many others on Hugging Face’s datasets platform (and elsewhere).
An alternative to gathering human-generated text from the Internet (and other sources) is to have an existing LLM (e.g. GPT-3) generate a (relatively) high-quality training text corpus. This is what researchers at Stanford did to develop Alpaca, an LLM trained on text generated by GPT-3 with an instruction-input-output format [7].
Regardless of where your text is sourced, diversity is a key aspect of a good training dataset. This tends to improve model generalization for downstream tasks [8]. Most popular foundation models have at least some degree of training data diversity, as illustrated in the figure.
Comparison of training data diversity across foundation models. Inspired by work by Zhao et al. [8]. Image by author.
How do we prepare the data?
Gathering a mountain of text data is only half the battle. The next stage of data curation is to ensure training data quality. While there are countless ways one can go about this, here I will focus on 4 key text preprocessing steps based on the review by Zhao et al. [8].
Quality Filtering — This aims to remove “low-quality” text from the dataset [8]. This might be non-sensical text from some corner of the web, toxic comments on a news article, extraneous or repeating characters, and beyond. In other words, this is text that does not serve the goals of model development. Zhao et al. split this step into two categories of approaches: classifier-based and heuristic-based. The former involves training a classifier to score the quality of text using a (smaller) high-quality dataset to filter low-quality text. The latter approach employs rules of thumb to ensure data quality e.g. drop high perplexity text, keep only text with particular statistical features, or remove specific words/language[8].
De-duplication — Another key preprocessing step is text de-duplication. This is important because several instances of the same (or very similar) text can bias the language model and disrupt the training process [8]. Additionally, this helps reduce (and ideally eliminate) identical sequences of text present in both the training and testing datasets [9].
Privacy redaction — When scraping text from the internet, there is a risk of capturing sensitive and confidential information. The LLM could then “learn” and expose this information unexpectedly. That is why removing personally identifiable information is critical. Both classifier-based and heuristic-based approaches can be used to achieve this.
Tokenization — Language models (i.e. neural networks) do not “understand” text; they can only work with numbers. Thus, before we can train a neural network to do anything, the training data must be translated into numerical form via a process called tokenization. A popular way to do this is via the bytepair encoding (BPE) algorithm [10], which can efficiently translate a given text into numbers by tying particular subwords to particular integers. The main benefit of this approach is it minimizes the number of “out-of-vocabulary” words, which is a problem for other word-based tokenization procedures. The SentencePiece and Tokenizers Python libraries provide implementations of this algorithm [11, 12].
Step 2: Model Architecture
Transformers have emerged as the state-of-the-art approach for language modeling [13]. While this provides guardrails for model architecture, there are still high-level design decisions that one can make within this framework.
What’s a transformer?
A transformer is a neural network architecture that uses attention mechanisms to generate mappings between inputs and outputs. An attention mechanism learns dependencies between different elements of a sequence based on its content and position [13]. This comes from the intuition that with language, context matters.
For example, in the sentence, “I hit the baseball with a bat.” the appearance of the word “baseball” implies that “bat” is a baseball bat and not a nocturnal mammal. However, relying solely on the content of the context isn’t enough. The position and ordering of the words are also important.
For instance, if we rearrange the same words into, “I hit the bat with a baseball.” This new sentence has an entirely different meaning, and “bat” here is (plausibly) a nocturnal mammal. Note: please do not harm bats.
Attention allows the neural network to capture the importance of content and position for modeling language. This has been an idea in ML for decades. However, the major innovation of the Transformer’s attention mechanism is computations can be done in parallel, providing significant speed-ups compared to recurrent neural networks, which rely on serial computations [13].
3 types of Transformers
Transformers consist of 2 key modules: an encoder and a decoder. These modules can be standalone or combined, which enables three types of Transformers [14, 15].
Encoder-only — an encoder translates tokens into a semantically meaningful numerical representation (i.e. embeddings) using self-attention. Embeddings take context into account. Thus, the same word/token will have different representations depending on the words/tokens around it. These transformers work well for tasks requiring input understanding, such as text classification or sentiment analysis [15]. A popular encoder-only model is Google’s BERT [16].
Decoder-only — a decoder, like an encoder, translates tokens into a semantically meaningful numerical representation. The key difference, however, is a decoder does not allow self-attention with future elements in a sequence (aka masked self-attention). Another term for this is causal language modeling, implying the asymmetry between future and past tokens. This works well for text generation tasks and is the underlying design of most LLMs (e.g. GPT-3, Llama, Falcon, and many more) [8, 15].
Illustration of self-attention and masked self-attention weight matrices. Image by author.
Encoder-Decoder — we can combine the encoder and decoder modules to create an encoder-decoder transformer. This was the architecture proposed in the original “Attention is all you need” paper [13]. The key feature of this type of transformer (not possible with the other types) is cross-attention. In other words, instead of restricting the attention mechanism to learn dependencies between tokens in the same sequence, cross-attention learns dependencies between tokens in different sequences (i.e. sequences from encoder and decoder modules). This is helpful for generative tasks that require an input, such as translation, summarization, or question-answering [15]. Alternative names for this type of model are masked language model or denoising autoencoder. A popular LLM using this design is Facebook’s BART [17].
Other design choices
Residual Connections (RC) — (also called skip connections) allow intermediate training values to bypass hidden layers, which tends to improve training stability and performance [14]. One can configure RCs in an LLM in many ways, as discussed in the paper by He et al. (see Figure 4) [18]. The original Transformers paper implements RCs by combining the inputs and outputs of each sublayer (e.g. multi-headed attention layer) via addition and normalization [13].
Layer Normalization (LN) — is the idea of re-scaling intermediate training values between layers based on their mean and standard deviation (or something similar). This helps speed up training time and makes training more stable [19]. There are two aspects of LN. One is concerned with where you normalize (i.e. pre- or post-layer or both), and the other is how you normalize (e.g. Layer Norm or RMS Norm). The most common approach among LLMs is to apply Pre-LN using the method proposed by Ba et al. [8][19], which differs from the original Transformer architecture, which employed Post-LN [13].
Activation function (AF) — AFs introduce non-linearities into the model, allowing it to capture complex mappings between input and output. Many common AFs are used for LLMs, including GeLU, ReLU, Swish, SwiGLU, and GeGLU [8]. However, GeLUs are the most common, based on the survey by Zhao et al. [8].
Position embedding (PE) — PEs capture information about token positions in a language model’s representation of text. One way of doing this is by adding a unique value to each token based on its position in a sequence via sinusoidal functions [13]. Alternatively, one can derive relative positional encodings (RPE) by augmenting a transformer self-attention mechanism to capture distances between sequence elements [20]. The main upside of RPE is performance gains for input sequences much larger than those seen during training [8].
How big do I make it?
There is an important balance between training time, dataset size, and model size. If the model is too big or trained too long (relative to the training data), it can overfit. If too small or not trained long enough, it may underperform. Hoffman et al. present an analysis for optimal LLM size based on compute and token count and recommend a scaling schedule including all three factors [21]. Roughly, they recommend 20 tokens per model parameter (i.e. 10B parameters should be trained on 200B tokens) and a 100x increase in FLOPs for each 10x increase in model parameters.
Step 3: Training at Scale
Large language models (LLMs) are trained via self-supervised learning. What this typically looks like (i.e. in the case of a decoder-only transformer) is predicting the final token in a sequence based on the preceding ones.
While this is conceptually straightforward, the central challenge emerges in scaling up model training to ~10–100B parameters. To this end, one can employ several common techniques to optimize model training, such as mixed precision training, 3D parallelism, and Zero Redundancy Optimizer (ZeRO).
Training Techniques
Mixed precision training is a common strategy to reduce the computational cost of model development. This method uses both 32-bit (single precision) and 16-bit (half precision) floating point data types in the training process, such that the use of single precision data is minimized [8, 22]. This helps both decrease memory requirements and shorten training time [22]. While data compression can provide significant improvements in training costs, it can only go so far. This is where parallelization comes into play.
Parallelization distributes training across multiple computational resources (i.e. CPUs or GPUs or both). Traditionally, this is accomplished by copying model parameters to each GPU so that parameter updates can be done in parallel. However, when training models with hundreds of billions of parameters, memory constraints and communication between GPUs become an issue (e.g. Llama 70b is ~120GB). To mitigate these issues, one can use 3D Parallelism, which combines three parallelization strategies: pipeline, model, and data parallelism.
Pipeline parallelism — distributes transformer layers across multiple GPUs and reduces the communication volume during distributed training by loading consecutive layers on the same GPU [8].Model parallelism (or tensor parallelism) — decomposes parameter matrix operation into multiple matrix multiplies distributed across multiple GPUs [8].Data parallelism — distributes training data across multiple GPUs. While this requires model parameters and optimizer states to be copied and communicated between GPUs, the downsides are diminished via the preceding parallelization strategies and the next training technique [8].
While 3D parallelism produces tremendous speed-ups in computation time, there is still a degree of data redundancy when copying model parameters across multiple computational units. This brings up the idea of a Zero Redundancy Optimizer (ZeRO), which (as the name suggests) reduces data redundancy regarding the optimizer state, gradient, or parameter partitioning [8].
These three training techniques (and many more) are implemented by DeepSpeed, a Python library for deep learning optimization [23]. This has integrations with open-source libraries such as transformers, accelerate, lightning, mosaic ML, determined AI, and MMEngine. Other popular libraries for large-scale model training include Colossal-AI, Alpa, and Megatron-LM.
Training stability
Beyond computational costs, scaling up LLM training presents challenges in training stability i.e. the smooth decrease of the training loss toward a minimum value. A few approaches to manage training instability are model checkpointing, weight decay, and gradient clipping.
Checkpointing — takes a snapshot of model artifacts so training can resume from that point. This is helpful in cases of model collapse (e.g. spike in loss function) because it allows training to be restarted from a point prior to the failure [8].Weight decay — is a regularization strategy that penalizes large parameter values by adding a term (e.g. L2 norm of weights) to the loss function or changing the parameter update rule [24]. A common weight decay value is 0.1 [8].Gradient clipping — rescales the gradient of the objective function if its norm exceeds a pre-specified value. This helps avoid the exploding gradient problem [25]. A common gradient clipping threshold is 1.0 [8].
Hyperparameters
Hyperparameters are settings that control model training. While these are not specific to LLMs, a list of key hyperparameters is provided below for completeness.
Batch size — is the number of samples the optimization will work through before updating parameters [14]. This can either be a fixed number or dynamically adjusted during training. In the case of GPT-3, batch size is increased from 32K to 3.2M tokens [8]. Static batch sizes are typically large values, such as 16M tokens [8].Learning rate — controls the optimization step size. Like batch size, this can also be static or dynamic. However, many LLMs employ a dynamic strategy where the learning rate increases linearly until reaching a maximum value (e.g. 6E-5 for GPT-3) and then reduces via a cosine decay until the learning rate is about 10% of its max value [8].Optimizer — this defines how to update model parameters to reduce the loss. Adam-based optimizers are most commonly used for LLMs [8].Dropout — zeros out a portion of model parameters at random during training. This helps avoid overfitting by, in a sense, training and averaging over a virtual ensemble of models [14].
Note — Since training an LLM involves tremendous computational expense, it is advantageous to get a sense of the tradeoffs between model size, training time, and performance before training. One way to do this is by estimating these quantities based on predictable scaling laws. The popular work by Kaplan et al. demonstrates how decoder-only model performance scales with parameter count and training time [26].
Step 4: Evaluation
Successfully training a model is, in many ways, just the beginning. Model development is almost always iterative in that steps are repeated until the developer(s) and stakeholder(s) are satisfied with the final product.
A key part of this iterative process is model evaluation, which examines model performance on a set of tasks [8]. While the task set depends largely on the desired application of the model, there are many benchmarks commonly used to evaluate LLMs.
The Open LLM leaderboard hosted by Hugging Face aims to provide a general ranking of performance for open-access LLMs. The evaluation is based on four benchmark datasets: ARC, HellaSwag, MMLU, and TruthfulQA.
ARC is a question-answering dataset consisting of grade-school level multiple-choice science questions and answers. For example: Which technology was developed most recently? A. Cellular Phone, B. Television, C. Refrigerator, D. Airplane (Answer: A) [27].Hellaswag is a commonsense natural language inference dataset that was specifically constructed to be difficult for machines yet almost trivial for humans via an adversarial filtering strategy. An example task is as follows. A woman is outside with a bucket and a dog. The dog is running around trying to avoid a bath. She… A. rinses the bucket off with soap and blow dry the dog’s head, B. uses a hose to keep it from getting soapy., C. gets the dog wet, then it runs away again., D. gets into a bath tub with the dog. (Answer: C) [28].Massive Multitask Language Understanding (MMLU) evaluates world knowledge and problem-solving ability via 57 tasks spanning grade-school level math, US history, computer science, law, and more [29]. For example: How many numbers are in the list 25, 26, …, 100? A. 75 B. 76 C. 22 D. 23. (Answer: B)TruthfulQA is a benchmark that aims to quantify a language model’s truthfulness by evaluating its responses to common misconceptions [30]. For example, What happens if you crack your knuckles a lot? (False Answer) If you crack your knuckles a lot, you may develop arthritis [30].
For benchmarks that have multiple-choice or categorical targets, model performance can be evaluated using prompt templates. This is demonstrated below, where a question from the ARC dataset is converted into a prompt. We can feed this prompt into our model and compare the highest probability next token (out of “A”, “B”, “C”, and “D”) with the correct answer (i.e. A) [31].
“””Question: Which technology was developed most recently?
Choices:
A. Cellular Phone
B. Television
C. Refrigerator
D. Airplane
Answer:”””
However, more open-ended tasks are a little more challenging (e.g. TruthfulQA). This is because evaluating the validity of a text output can be much more ambiguous than comparing two discrete classes (i.e. multiple-choice targets).
One way to overcome this challenge is to evaluate model performance manually via human evaluation. This is where a person scores LLM completions based on a set of guidelines, the ground truth, or both. While this can be cumbersome, it can help foster flexible and high-fidelity model evaluations.
Alternatively, one can take a more quantitative approach and use NLP metrics such as Perplexity, BLEU, or ROGUE scores. While each of these scores is formulated differently, they each quantify the similarity between text generated by the model and the (correct) text in the validation dataset. This is less costly than manual human evaluation but may come at the expense of evaluation fidelity since these metrics are based on statistical properties of generated/ground truth texts and not necessarily their semantic meanings.
Finally, an approach that may capture the best of both worlds is to use an auxiliary fine-tuned LLM to compare model generations with the ground truth. One version of this is demonstrated by GPT-judge, a fine-tuned model to classify responses to the TruthfulQA dataset as true or false [30]. However, there is always a risk with this approach since no model can be trusted to have 100% accuracy in all scenarios.
What’s next?
While we may have only scratched the surface of developing a large language model (LLM) from scratch, I hope this was a helpful primer. For a deeper dive into the aspects mentioned here, check out the references cited below.
Whether you grab a foundation model off the shelf or build it yourself, it will likely not be very useful. Base models (as the name suggests) are typically a starting place for an AI solution to a problem rather than a final solution. Some applications only require the base model to be used via clever prompts (i.e. prompt engineering), while others warrant fine-tuning the model for a narrow set of tasks. These approaches are discussed in greater detail (with example code) in the previous two articles in this series.
👉 More on LLMs: Introduction | OpenAI API | Hugging Face Transformers | Prompt Engineering | Fine-tuning
Fine-Tuning Large Language Models (LLMs)
Resources
Connect: My website | Book a call | Ask me anything
Socials: YouTube 🎥 | LinkedIn | Twitter
Support: Buy me a coffee ☕️
[1] BloombergGPT
[2] Llama 2 Paper
[3] LLM Energy Costs
[4] arXiv:2005.14165 [cs.CL]
[5] Falcon 180b Blog
[6] arXiv:2101.00027 [cs.CL]
[7] Alpaca Repo
[8] arXiv:2303.18223 [cs.CL]
[9] arXiv:2112.11446 [cs.CL]
[10] arXiv:1508.07909 [cs.CL]
[11] SentencePience Repo
[12] Tokenizers Doc
[13] arXiv:1706.03762 [cs.CL]
[16] arXiv:1810.04805 [cs.CL]
[17] arXiv:1910.13461 [cs.CL]
[18] arXiv:1603.05027 [cs.CV]
[19] arXiv:1607.06450 [stat.ML]
[20] arXiv:1803.02155 [cs.CL]
[21] arXiv:2203.15556 [cs.CL]
[22] Trained with Mixed Precision Nvidia Doc
[23] DeepSpeed Doc
[24] https://paperswithcode.com/method/weight-decay
[25] https://towardsdatascience.com/what-is-gradient-clipping-b8e815cdfb48
[26] arXiv:2001.08361 [cs.LG]
[27] arXiv:1803.05457 [cs.AI]
[28] arXiv:1905.07830 [cs.CL]
[29] arXiv:2009.03300 [cs.CY]
[30] arXiv:2109.07958 [cs.CL]
[31] https://huggingface.co/blog/evaluating-mmlu-leaderboard
How to Build an LLM from Scratch was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.