aniket mishrikotkar

the llama 3 herd of models

I went through the paper released by Meta on Llama 3 herd of models. It's a 92 page research paper with lots of practical know-hows including infrastructure and scaling.

Image

TL;DR

Introduction

Llama 3 are a set of foundation language models.

The development of foundation models has two main stages:

  1. A pre-training stage where a model is trained for the next-word prediction task
  2. a post-training stage where the model is tuned to improve specific capabilities like coding and reasoning.

There are three key components involved in training foundation models:

Data

They pre-trained Llama 3 on a corpus of 15.6T multilingual tokens (1.8T tokens for Llama 2).

Scale

They pre-trained Llama 3 using 3.8 x 10^25 FLOPS (50x more than Llama 2).

Managing complexity

They opted for standard transformer architecture with some modifications (no mixture-of-experts here). They also adopted a simple post-training process based on Supervised fine-tuning (SFT), Rejection sampling (RS), and Direct preference optimization (DPO).

Performance of fine-tuned Llama 3 models on key benchmark evaluations

Image

Overview

The development of Llama 3 has two main stages:

  1. Pre-training
  2. Post-training

To have rich capabilities of image, video and speech they have three additional stages:

Image

Pre-training

Pre-training has:

1. Curation of a training corpus

They used a variety of data sources containing knowledge until the end of 2023. They applied several de-duplication methods and cleaned the data.

Data mix They have talked about having data from different data sources in the pre-training data mix and to determine the data mix they used knowledge classification(develop classifier to categorize the types of information in the data) and scaling laws(train several models and use that to predict the performance of the model).

Annealing data Annealing(a process of slowly decreasing the probability of accepting worse solutions as the solution space is explored) can boost the performance of pre-trained models.

2. Development of model architecture

They used a standard dense transformer architecture. Their performance gains are primarily because of high-quality data and increased training scale. Some of the modifications they did to the architecture are:

Image

3. Determine the model size using the scaling laws

Scaling laws are used to determine the optimal model size.

To predict downstream benchmark performance, they implemented a two-stage methodology.

Their experiments show that the flagship model is robust to small changes in the trade-off between model size and training tokens.

4. Efficient pre-training at scale

This section was my favourite and I loved reading through it.

They discuss about the training infrastructure in amazing detail.

Parallelism for model scaling

This is what I was most excited about and the detail they go into is outstanding. To get a fair amount of understanding, read about it here.

Image

To scale training, they've used 4D parallelism - a combination of four different types of parallelism methods. It combines tensor, pipeline, context and data parallelism.

Some of the challenges they encountered:

I want to go in-depth on this but some other time, perhaps in a different post.

Development of pre-training recipe

The training recipe includes:

Post-training

They align the models by applying several rounds of post-training, or aligning the model with human feedback on top of a pre-trained checkpoint. Each round of post-training involves supervised finetuning (SFT) followed by Direct Preference Optimization (DPO) on examples collected either via human annotations or generated synthetically.

Image

They first train a reward model on top of the pre-trained checkpoint using human-annotated preference data. Then fine-tune pre-trained checkpoints with supervised fine-tuning (SFT), and further align the checkpoints with Direct Preference Optimization (DPO).

Capabilities

They have discussed the capabilities of these model in detail like how are they trained, what data was used. I haven't covered them here as the concepts are very specific to the capability but anyway it's worth the read.

Conclusion

This is arguably some of the best write-up I've ever read. They not only go in detail about the approaches they took, challenges they faced but also try to make us understand why they made such a decision.

References

[1] They released a blog post.

[2] You can play with Llama 3.1 here.

[3] Meta is committed to open-source AI development. Read Mark's letter.