Paper breakdown

Training Compute-Optimal Large Language Models

Jordan Hoffmann et al. · 2022 · NeurIPS 2022

Refits the language-model scaling laws across 400+ runs at fixed compute and finds that parameters and tokens should scale roughly equally, not 0.73 vs 0.27 as Kaplan et al. claimed. The 70B Chinchilla beats the 280B Gopher on the same FLOP budget. The corrected ratio (~20 tokens per parameter at training compute) became the Chinchilla rule for compute-optimal training.

arXiv:2203.15556

Overview

Hoffmann et al. (2022) re-ran the language-model scaling experiment that Kaplan et al. (2020) had used to plan GPT-3 and arrived at a different optimum. Trained over 400 transformer language models from 70M to 16B parameters at fixed compute budgets, they find that the loss-minimising allocation of compute is roughly $N \propto C^{0.5}$ in parameters and $D \propto C^{0.5}$ in tokens, not $N \propto C^{0.73}$ and $D \propto C^{0.27}$ as the 2020 paper had reported.

The empirical headline is the Chinchilla model itself. Trained at the same FLOP budget as DeepMind's earlier 280B Gopher, but with 70B parameters and 1.4T tokens (instead of 280B parameters and 300B tokens), Chinchilla outperforms Gopher on every benchmark in the paper — MMLU, BIG-bench, the closed-book QA suite, the LM Evaluation Harness — by margins large enough that the comparison is unambiguous. The mechanism is that Gopher was undertrained: 300B tokens for 280B parameters is far below the compute-optimal data allocation.

The scaling-law form survives. Cross-entropy loss still descends as a power of compute over five orders of magnitude. What changed is the constants and the optimal $N$ -to- $D$ ratio. The "20 tokens per parameter" rule that emerges from the fits is now the operating prescription for pretraining compute allocation across the open-weight ecosystem (Llama, DeepSeek, Qwen, Mistral all use ratios in this neighbourhood for their pretraining runs).

The paper is also a clean example of how methodological details — specifically, the cosine learning-rate schedule — can dominate the conclusions of a scaling study. Kaplan et al. used a fixed schedule anchored to their largest run; Chinchilla retunes the schedule per-run. That alone moves the optimum.

Mathematical Contributions

Three estimation approaches

The paper estimates the compute-optimal frontier three ways and reports that all three agree:

Approach 1 — fix model size, vary tokens. Train a fixed model on token budgets that span 100×, plot loss against tokens for each size, fit the envelope.

Approach 2 — fix compute, vary model size. For each compute budget $C \in \{6 \times 10^{18}, \ldots, 3 \times 10^{21}\}$ FLOPs, train models of various sizes for the implied number of tokens ( $D = C / (6N)$ from the standard transformer FLOP count) and read off the size with lowest loss.

Approach 3 — parametric fit. Fit the joint loss surface

$L(N, D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}$

over all 400+ runs, then minimise over $(N, D)$ subject to $C = 6 N D$ .

The fitted exponents (Section 3.3, Equation 10) are $\alpha \approx 0.339$ and $\beta \approx 0.285$ , with $E$ representing the irreducible loss of the data distribution. The compute-optimal allocation is

$N_{\mathrm{opt}}(C) \propto C^{a},\quad D_{\mathrm{opt}}(C) \propto C^{b}, \qquad a = \frac{\beta}{\alpha + \beta} \approx 0.46,\quad b = \frac{\alpha}{\alpha + \beta} \approx 0.54.$

The headline reading " $N$ and $D$ scale equally" is the rounding of these to $a \approx b \approx 0.5$ . All three approaches give exponents within ~0.04 of each other.

The 20-tokens-per-parameter rule

The compute-optimal ratio $D_{\mathrm{opt}} / N_{\mathrm{opt}}$ at the FLOP budgets in the paper sits in the range 17–22. The paper's Table 3 reads off compute-optimal $(N, D)$ pairs at fixed $C$ values:

Compute (FLOPs)	$N_{\mathrm{opt}}$	$D_{\mathrm{opt}}$	$D / N$
$1.92 \times 10^{19}$	400M	8.0B	20
$1.21 \times 10^{20}$	1B	20.2B	20
$5.76 \times 10^{23}$	70B	1.4T	20
$3.85 \times 10^{24}$	175B	3.7T	21
$9.90 \times 10^{24}$	280B	5.9T	21

For comparison, Kaplan et al.'s prescription gives $D / N$ that shrinks with scale; at $C \approx 5 \times 10^{23}$ FLOPs the implied $D / N$ from their exponents is around 1–2, an order of magnitude below the compute-optimum the Chinchilla fits identify.

The empirical Chinchilla vs. Gopher comparison

The paper's most-cited table (Table 5) trains a single 70B-parameter model on 1.4T tokens (the predicted compute optimum for the Gopher FLOP budget) and compares it to the actual 280B Gopher. Chinchilla wins on:

MMLU 5-shot: 67.6% vs. 60.0%
The Lambada language modelling task
All six closed-book QA tasks (NaturalQS, TriviaQA, etc.)
50 of 57 BIG-bench tasks where the comparison is meaningful

Both models cost the same in training FLOPs ( $\approx 5.76 \times 10^{23}$ ). The 4× reduction in parameter count makes Chinchilla substantially cheaper to serve, on top of being more capable.

What changed methodologically vs. Kaplan

Kaplan et al. trained models with a fixed cosine learning-rate schedule whose total length was set to the longest run in their sweep. For shorter runs, the schedule decays slowly relative to the run length, leaving the small models effectively under-converged. Chinchilla retunes the cosine length to match the actual training-token budget per run.

The paper documents this in Section 4 and shows it accounts for most of the difference. When Kaplan-style schedules are reapplied to Chinchilla's data, the parameter exponent inflates and the data exponent collapses, recovering Kaplan's allocation. The "scaling laws" themselves were not wrong; they were measured under a learning-rate schedule that biased small-model loss upward.

What stayed the same

The functional form. Power laws in $N$ , $D$ , and $C$ over five-plus orders of magnitude continue to hold. The architectural insensitivity (Section 4 of Kaplan: depth/width/head-count don't matter at fixed $N$ ) was not re-examined here but later work (Hoffmann's own follow-ons, the OPT and BLOOM scaling reports) confirmed it. The cross-entropy loss as the metric of interest stayed the same — emergent-capability arguments came later.

What It Gets Right

The first thing is the experimental design. Three independent estimation methods (varying tokens, varying model size, fitting a joint surface) give cross-validating results within reasonable error bars. Each has its own systematic biases, and the agreement is the strongest evidence the paper has that its allocation is right.

The second is the explicit mechanism for the disagreement with Kaplan et al. Section 4 does not just claim "we got different numbers"; it identifies the cosine-schedule confound, retraces it, and shows that controlling for it reproduces Kaplan's exponents. This is the kind of post-hoc explanatory work that converts a follow-up paper from "results differ" to "we know why".

The third is the actionable headline. "Train ~20 tokens per parameter at this compute scale" is concrete enough that practitioners did act on it. From mid-2022 onward the public ecosystem mostly trained at Chinchilla-optimal ratios; the under-training that Gopher exemplified became rare.

Common Misconceptions

The 20-tokens-per-parameter rule is for training compute, not inference. Once a model is trained, you may want to over-train (use $D / N \gg 20$ ) to get a smaller model that is cheaper to serve at the same loss. Llama-3 8B trained on ~15T tokens has $D / N \approx 1900$ — very far from compute-optimal, on purpose, because the inference-cost calculus is different from the training-cost calculus. Sardana et al. (2024) and Touvron et al. (Llama-3) make this explicit.

The Chinchilla numbers are not a universal law. The exponents $\alpha$ and $\beta$ depend on the data mixture, the tokenizer, and the architecture. Different labs that re-fit on their own data get slightly different ratios — Llama-2's allocation differs from DeepSeek-V2's, and both differ from Chinchilla's original numbers. The form of the law is reliable; the constants are not portable across pretraining setups.

Chinchilla does not say bigger models are wasteful. It says that at a fixed training compute budget there is an optimum, and Gopher was not at it. If the budget grows, both $N$ and $D$ should grow; the optimum is a frontier, not a single point. Reporting "Chinchilla-optimal" without specifying the compute budget is meaningless.

The paper also does not claim downstream task accuracy follows the same scaling law as cross-entropy. Wei et al. (2022) and Schaeffer et al. (2023) document tasks where benchmark accuracy moves nonlinearly while loss moves smoothly. Compute-optimal in cross-entropy is not always compute-optimal in capability per FLOP.

Connections to TheoremPath Topics

Scaling laws — modern treatment unifying the Kaplan and Chinchilla fits.
Scaling: compute-optimal training — the corrected $N \approx D$ allocation and the $D/N \approx 20$ ratio.
Scaling Laws for Neural Language Models — the Kaplan paper Chinchilla corrects.
Inference-time scaling laws — the orthogonal axis that has reshaped post-Chinchilla compute planning.
GPT series evolution — GPT-3 was planned to Kaplan's prescription and was undertrained by Chinchilla's standards.
Bits, nats, perplexity, BPB — units in which the loss is reported.
Transformer architecture — the architecture whose scaling these laws describe.

References

Canonical:

Hoffmann, J. et al. (2022). "Training Compute-Optimal Large Language Models." NeurIPS. arXiv:2203.15556.

Direct precursor and the paper it corrects:

Kaplan, J. et al. (2020). "Scaling Laws for Neural Language Models." arXiv:2001.08361. The OpenAI scaling-laws paper whose constants Chinchilla revises.

Direct application:

Rae, J. W. et al. (2021). "Scaling Language Models: Methods, Analysis & Insights from Training Gopher." arXiv:2112.11446. The 280B Gopher model that Chinchilla's recipe outperforms at equal compute.

Critical refinements:

Hoffmann, J. et al. (2024). "Chinchilla scaling: A replication attempt." arXiv:2404.10102. Replication note correcting confidence intervals in the original Approach 3 fit.
Sardana, N., & Frankle, J. (2024). "Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws." arXiv:2401.00448.
Wei, J. et al. (2022). "Emergent Abilities of Large Language Models." TMLR. arXiv:2206.07682. Why cross-entropy power laws don't always translate to benchmark scaling.

Follow-on work building on the Chinchilla prescription:

Touvron, H. et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971. The first Llama; explicitly Chinchilla-aware in its data allocation.
Touvron, H. et al. (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models." arXiv:2307.09288.
Dubey, A. et al. (2024). "The Llama 3 Herd of Models." arXiv:2407.21783. Departs from Chinchilla-optimal training in favour of inference-cost-aware overtraining.

Standard textbook:

Prince, S. J. D. (2023). Understanding Deep Learning. MIT Press. Chapter 12.7 — scaling laws.

Connected topics

Last reviewed: May 7, 2026