How code LLMs progressed from RNNs to Transformers
Photo by Markus Spiske on Unsplash
Introduction
Recent years have seen remarkable evolution of language models with the introduction of Transformers, which has revolutionized the way we perform our daily tasks like writing emails, creating documentations, searching the web and even the way we code. With researchers applying Large Language Models in code intelligence tasks, a new field of Neural Code Intelligence has emerged. This domain aims at improving programming efficiency and minimizing human errors in the software industry by solving tasks like code summarization, generation and translation.
With the latest release of Code Llama, the state of art model by Meta AI for code generation and understanding, this article looks back at the evolution of Large Language Models (LLMs) for Code, from RNNs to Transformers.
Fig-1: A timeline for Large Language Models For Code. Image by Author.
Code2Vec, 2018
This was one of the first attempts for language models to understand code. Code2Vec aimed at representing code snippets into embeddings. These embeddings capture semantic and structural information from the code, making them useful for various software engineering tasks such as code classification, retrieval, and understanding.
Model tries to predict the method name from the code-snippet, by encoding well-named tokens and AST (Abstract Syntax Tree) paths, and applying neural attention for aggregation into fixed length vector representation.
Fig-2: Code2Vec Model Architecture: Program is first decomposed into Bag of Context which includes tokens and AST paths, then through fully connected layer and attention layer to generate code-vector.Image inspired from the original paper by Uri Alon et. al from Code2Vec
Training Set: 14M Java Program Examples
Model Architecture: RNN + Feed-Forward Network
Novelty:
Path-based Attention Model- The authors propose a novel neural network architecture that uses syntactic paths in the Abstract Syntax Tree (AST) of a code snippet as input features. The model learns to assign different attention weights to each path, and to aggregate them into a single code vector. The code vector can then be used to predict the label distribution for the snippet, or to measure similarity and analogy between snippets.You can play with the model here
CodeBERT, 2020
CodeBERT, developed by Microsoft Research team, represents a significant advancement in the realm of Large Language Models (LLMs) for code by introducing multimodal data pre-training, combining Natural Language and Programming Language (NL + PL) on the Transformer based BERT model. The model is trained on a diverse dataset comprising both bimodal data points pair and unimodal data points for Masked Language Modeling (MLM) and Replaced Token Detection (RTD) tasks. CodeBERT demonstrated exceptional performance in a variety of domains, excelling notably in natural language code search and code to documentation generation.
Fig-3: CodeBERT model pre-training using Replace Token Detection (RTD) task. Natural Language Generation and Code Generator replacing tokens with a different token, and the CodeBERT model is trained to classify each token as replaced or original. Image from Feng et. al, CodeBERT
Training Dataset:
Codesearch Net Dataset– 2.1M bimodal Data points (NL + PL), 6.4M Unimodal Data Points (6 languages — Python, Java, Javascript, PHP, Ruby, Go)
Parameter Size: 125M
Model Architecture: RoBERTa-base
Novelty:
Bimodal Training: CodeBERT introduces an innovative training approach that encompasses both Natural Language and Programming Language tokens. This bimodal training technique enhances the model’s ability to understand and generate code by considering the intricate interplay between human-readable descriptions and programming language elements.Replace Token Detection (RTD) Task for code: CodeBERT pre-training used Replace Token Detection (RTD) instead of Next Sentence Prediction(NSP) which showed superior performance.
Codex, 2021
Codex was one of the first successful Code LLM to generate code from doc-string or Natural language prompts with high accuracy, and predecessor of widely used Github Copilot. Developed by the OpenAI team, Codex uses GPT3 architecture & tokenizer, and pre-trains on a large corpus of Github code. This Large Language model has 12B parameters, and was a state-of-art model in 2021, which showed best performance on human-eval dataset by solving 28.8% of the problems at first pass.
Further fine-tuning of the model on standalone python functions (rather than whole code which include configs, class implementations etc.), showed significant improvement, and was able to solve 37.7% of the human-eval dataset problem.
Fig-4: A decoder only Transformer architecture used for Codex GPT model. Image inspired from original Transformer paper by Vaswani et. al.
Training Dataset: 159GB of python files from 54M Github Repositories.
Parameter Size: 12B (Codex- 12B)
Model Architecture: GPT3
Novelty:
One of the first successful models which excelled in code-writing capabilities from Natural language prompts. This trains GPT-3 models on a large corpus of Github repositories.Authors of this model also created a new dataset, “HumanEval” to benchmark models for code-generation tasks. This dataset consists of 164 hand-written programming problems with unit tests.Try Codex Model at OpenAI Playground here
CodeT5, 2021
Code-T5 is an encoder-decoder model based on the T5 architecture, distinct from both CodeBERT (encoder-only) and Codex (decoder-only) models. It introduces a unique identifier-aware denoising pre-training task which helps the model distinguish and recover identifiers in code, enhancing its understanding of structure.
Code-T5 excels in various tasks such as Code Defect Detection, Clone Detection, Code Translation, and Refinement, through multi-task learning, requiring less data for quicker fine-tuning. However, it uses CodeBleu scores for evaluation rather than benchmarking against the HumanEval dataset.
Fig-5: Illustration to show how CodeT5 excels in various code understanding and generation tasks. Image taken from Paper by Wang et al, CodeT5
Training Dataset: Codesearch Net Dataset (Same as CodeBERT)
Parameter Size: 220M
Model Architecture: T5 (Encoder-Decoder Architecture)
Novelty:
Encoder-Decoder Model: One of the first Encoder-Decoder Code LLM to support both code-understanding and code-generation tasks.Proposes a novel pre-training objective identifier-aware denoising, which learns token-type information and structure of the code. This approach trains models to differentiate between identifiers (variable names, function names) from PL keywords (like if, while etc.), and also recovers them when they are masked.Multi-Task Learning in Fine Tuning stage: Fine-tunes on various Code related tasks simultaneously like Code Defect Detection, Clone Detection, Code Translation, Refinement etc.
PLBart, 2021
PLBART, or Program and Language BART, model leverages the BART model architecture to automate a range of software engineering tasks, encompassing code summarization, generation, and translation under the umbrella of PLUG (Program and Language Understanding and Generation).
It introduces a denoising sequence-to-sequence modeling approach for enhanced Program and Language understanding, strategically combining the strengths of BERT and GPT models. This is achieved by combining a bidirectional encoder with an autoregressive decoder, allowing for a more comprehensive grasp of context and a versatile generation process. The model employs three denoising strategies, including token masking, token deletion, and token infilling, to train and fine-tune its capabilities effectively.
Fig-6: Illustration to visualize the BART model (used in PLBART too) architecture which has bidirectional encoder and autoregressive decoder. Image from original BART paper by Lewis et. al.
Training Dataset: 2M Java and Python Functions and their Natural Language descriptions collected from Github, Stackoverflow (code).
Parameter Size: 140M (6 encoder layer + 6 decoder layer + additional norm layer on encoder and decoder)
Model Architecture: BART
Novelty:
Denoising Auto-encoder Approach: Employs a denoising auto-encoder approach, which enhances code understanding and generation by effectively utilizing the bidirectional and auto-regressive properties of both the encoder and decoder, combining the strengths of BERT and GPT models.Diverse Noising Strategies: Proposes multiple denoising strategies, such as token masking, token deletion, and token infilling. This diversity in noising techniques enhances the model’s robustness and effectiveness in learning from noisy data, contributing to improved code understanding and generation.Not all models use the same benchmark for evaluating the performance. PLBART authors don’t evaluate model performance on HumanEval, dataset used by majority of other models for benchmarking.
Code Llama, 2023
Code Llama is the latest Code LLM, released by Meta, which beats all the existing open-source models in several benchmark datasets. It scores 53% on HumanEval Dataset and 55% on MBPP dataset (only GPT-4 has better performance). These gains can be attributed to longer context length of 16K (4x of Llama2) and training pre-trained Llama 2 on extra 500B tokes from Program and Natural Language.
This model is suited best for Code Generation and Infilling tasks, and can act as best copilot during IDE based Software Development. Code Llama models family has 3 types of models-
Code LlamaCode Llama PythonCode Llama-Instruct
each of them coming in 3 sizes — 7B, 13B and 34B
Fig-7: Code Llama training and fine-tuning pipeline taking pre-trained Llama-2 model as input. Image from original Code Llama paper by Rozière et. al.
Training Dataset: 500B tokens + additional 100B tokens for Code llama Python on publicly available code
Model Architecture: Llama 2
Parameter Size: Available in 3 sizes — 7B, 13B and 34B.
Novelty:
Proposed a fine-tuning step to handle long sequences call Long Context Fine-Tuning, which increases context length to 16,384 (4x from Llama 2 context length i.e. 4096)Instruction Fine Tuning & Self-Instruct: One of the few models that performs instruction fine-tuning, which uses explicit instruction or prompts during the fine-tuning process. Instead of creating human feedback data which is expensive, authors propose a novel execution feedback approach to construct a self-instruction dataset.
Conclusion
Andrej Karapathy, one of the founders of Open AI, recently called Transformers the best idea in AI. He added that the transformer is like a general purpose differentiable computer which is simultaneously — expressive, optimizable and efficient (X post). As evident with the transformation it has brought in the last 3–4 years, the Transformer model has vast potential to further change the landscape of how we code as a software engineer, and I think this is just the beginning.
Follow me more!
I am a Staff ML Engineer @ LinkedIn. You can follow me at LinkedIn or Twitter. You can reach out to me for quick chat at Topmate.io
Cracking the Code LLMs was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.