Skip to main content

Paper breakdown

Training Compute-Optimal Large Language Models

Jordan Hoffmann et al. · 2022 · NeurIPS 2022

Refits the language-model scaling laws across 400+ runs at fixed compute and finds that parameters and tokens should scale roughly equally, not 0.73 vs 0.27 as Kaplan et al. claimed. The 70B Chinchilla beats the 280B Gopher on the same FLOP budget. The corrected ratio (~20 tokens per parameter at training compute) became the Chinchilla rule for compute-optimal training.

Overview

Hoffmann et al. (2022) re-ran the language-model scaling experiment that Kaplan et al. (2020) had used to plan GPT-3 and arrived at a different optimum. Trained over 400 transformer language models from 70M to 16B parameters at fixed compute budgets, they find that the loss-minimising allocation of compute is roughly NC0.5N \propto C^{0.5} in parameters and DC0.5D \propto C^{0.5} in tokens, not NC0.73N \propto C^{0.73} and DC0.27D \propto C^{0.27} as the 2020 paper had reported.

The empirical headline is the Chinchilla model itself. Trained at the same FLOP budget as DeepMind's earlier 280B Gopher, but with 70B parameters and 1.4T tokens (instead of 280B parameters and 300B tokens), Chinchilla outperforms Gopher on every benchmark in the paper — MMLU, BIG-bench, the closed-book QA suite, the LM Evaluation Harness — by margins large enough that the comparison is unambiguous. The mechanism is that Gopher was undertrained: 300B tokens for 280B parameters is far below the compute-optimal data allocation.

The scaling-law form survives. Cross-entropy loss still descends as a power of compute over five orders of magnitude. What changed is the constants and the optimal NN-to-DD ratio. The "20 tokens per parameter" rule that emerges from the fits is now the operating prescription for pretraining compute allocation across the open-weight ecosystem (Llama, DeepSeek, Qwen, Mistral all use ratios in this neighbourhood for their pretraining runs).

The paper is also a clean example of how methodological details — specifically, the cosine learning-rate schedule — can dominate the conclusions of a scaling study. Kaplan et al. used a fixed schedule anchored to their largest run; Chinchilla retunes the schedule per-run. That alone moves the optimum.

Mathematical Contributions

Three estimation approaches

The paper estimates the compute-optimal frontier three ways and reports that all three agree:

Approach 1 — fix model size, vary tokens. Train a fixed model on token budgets that span 100×, plot loss against tokens for each size, fit the envelope.

Approach 2 — fix compute, vary model size. For each compute budget C{6×1018,,3×1021}C \in \{6 \times 10^{18}, \ldots, 3 \times 10^{21}\} FLOPs, train models of various sizes for the implied number of tokens (D=C/(6N)D = C / (6N) from the standard transformer FLOP count) and read off the size with lowest loss.

Approach 3 — parametric fit. Fit the joint loss surface

L(N,D)=E+ANα+BDβL(N, D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}

over all 400+ runs, then minimise over (N,D)(N, D) subject to C=6NDC = 6 N D.

The fitted exponents (Section 3.3, Equation 10) are α0.339\alpha \approx 0.339 and β0.285\beta \approx 0.285, with EE representing the irreducible loss of the data distribution. The compute-optimal allocation is

Nopt(C)Ca,Dopt(C)Cb,a=βα+β0.46,b=αα+β0.54.N_{\mathrm{opt}}(C) \propto C^{a},\quad D_{\mathrm{opt}}(C) \propto C^{b}, \qquad a = \frac{\beta}{\alpha + \beta} \approx 0.46,\quad b = \frac{\alpha}{\alpha + \beta} \approx 0.54.

The headline reading "NN and DD scale equally" is the rounding of these to ab0.5a \approx b \approx 0.5. All three approaches give exponents within ~0.04 of each other.

The 20-tokens-per-parameter rule

The compute-optimal ratio Dopt/NoptD_{\mathrm{opt}} / N_{\mathrm{opt}} at the FLOP budgets in the paper sits in the range 17–22. The paper's Table 3 reads off compute-optimal (N,D)(N, D) pairs at fixed CC values:

Compute (FLOPs)NoptN_{\mathrm{opt}}DoptD_{\mathrm{opt}}D/ND / N
1.92×10191.92 \times 10^{19}400M8.0B20
1.21×10201.21 \times 10^{20}1B20.2B20
5.76×10235.76 \times 10^{23}70B1.4T20
3.85×10243.85 \times 10^{24}175B3.7T21
9.90×10249.90 \times 10^{24}280B5.9T21

For comparison, Kaplan et al.'s prescription gives D/ND / N that shrinks with scale; at C5×1023C \approx 5 \times 10^{23} FLOPs the implied D/ND / N from their exponents is around 1–2, an order of magnitude below the compute-optimum the Chinchilla fits identify.

The empirical Chinchilla vs. Gopher comparison

The paper's most-cited table (Table 5) trains a single 70B-parameter model on 1.4T tokens (the predicted compute optimum for the Gopher FLOP budget) and compares it to the actual 280B Gopher. Chinchilla wins on:

  • MMLU 5-shot: 67.6% vs. 60.0%
  • The Lambada language modelling task
  • All six closed-book QA tasks (NaturalQS, TriviaQA, etc.)
  • 50 of 57 BIG-bench tasks where the comparison is meaningful

Both models cost the same in training FLOPs (5.76×1023\approx 5.76 \times 10^{23}). The 4× reduction in parameter count makes Chinchilla substantially cheaper to serve, on top of being more capable.

What changed methodologically vs. Kaplan

Kaplan et al. trained models with a fixed cosine learning-rate schedule whose total length was set to the longest run in their sweep. For shorter runs, the schedule decays slowly relative to the run length, leaving the small models effectively under-converged. Chinchilla retunes the cosine length to match the actual training-token budget per run.

The paper documents this in Section 4 and shows it accounts for most of the difference. When Kaplan-style schedules are reapplied to Chinchilla's data, the parameter exponent inflates and the data exponent collapses, recovering Kaplan's allocation. The "scaling laws" themselves were not wrong; they were measured under a learning-rate schedule that biased small-model loss upward.

What stayed the same

The functional form. Power laws in NN, DD, and CC over five-plus orders of magnitude continue to hold. The architectural insensitivity (Section 4 of Kaplan: depth/width/head-count don't matter at fixed NN) was not re-examined here but later work (Hoffmann's own follow-ons, the OPT and BLOOM scaling reports) confirmed it. The cross-entropy loss as the metric of interest stayed the same — emergent-capability arguments came later.

What It Gets Right

The first thing is the experimental design. Three independent estimation methods (varying tokens, varying model size, fitting a joint surface) give cross-validating results within reasonable error bars. Each has its own systematic biases, and the agreement is the strongest evidence the paper has that its allocation is right.

The second is the explicit mechanism for the disagreement with Kaplan et al. Section 4 does not just claim "we got different numbers"; it identifies the cosine-schedule confound, retraces it, and shows that controlling for it reproduces Kaplan's exponents. This is the kind of post-hoc explanatory work that converts a follow-up paper from "results differ" to "we know why".

The third is the actionable headline. "Train ~20 tokens per parameter at this compute scale" is concrete enough that practitioners did act on it. From mid-2022 onward the public ecosystem mostly trained at Chinchilla-optimal ratios; the under-training that Gopher exemplified became rare.

Common Misconceptions

The 20-tokens-per-parameter rule is for training compute, not inference. Once a model is trained, you may want to over-train (use D/N20D / N \gg 20) to get a smaller model that is cheaper to serve at the same loss. Llama-3 8B trained on ~15T tokens has D/N1900D / N \approx 1900 — very far from compute-optimal, on purpose, because the inference-cost calculus is different from the training-cost calculus. Sardana et al. (2024) and Touvron et al. (Llama-3) make this explicit.

The Chinchilla numbers are not a universal law. The exponents α\alpha and β\beta depend on the data mixture, the tokenizer, and the architecture. Different labs that re-fit on their own data get slightly different ratios — Llama-2's allocation differs from DeepSeek-V2's, and both differ from Chinchilla's original numbers. The form of the law is reliable; the constants are not portable across pretraining setups.

Chinchilla does not say bigger models are wasteful. It says that at a fixed training compute budget there is an optimum, and Gopher was not at it. If the budget grows, both NN and DD should grow; the optimum is a frontier, not a single point. Reporting "Chinchilla-optimal" without specifying the compute budget is meaningless.

The paper also does not claim downstream task accuracy follows the same scaling law as cross-entropy. Wei et al. (2022) and Schaeffer et al. (2023) document tasks where benchmark accuracy moves nonlinearly while loss moves smoothly. Compute-optimal in cross-entropy is not always compute-optimal in capability per FLOP.

Connections to TheoremPath Topics

Further Reading

  • Hoffmann, J. et al. (2024). "Chinchilla scaling: A replication attempt." arXiv:2404.10102. Besiroglu et al. flag a numerical discrepancy in Approach 3's reported confidence intervals; the headline allocation survives but the original error bars were too tight.
  • Sardana, N., & Frankle, J. (2024). "Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws." arXiv:2401.00448. Adds inference compute to the optimisation, justifies over-training relative to Chinchilla when expected inference volume is large.
  • Sorscher, B. et al. (2022). "Beyond neural scaling laws: beating power law scaling via data pruning." NeurIPS. arXiv:2206.14486. Argues data quality, not just data quantity, can break the standard power-law form.
  • Touvron, H. et al. (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models." arXiv:2307.09288. Llama-2 trains for far more tokens per parameter than Chinchilla-optimal, in service of inference cost — the practical breakpoint.

References

Canonical:

  • Hoffmann, J. et al. (2022). "Training Compute-Optimal Large Language Models." NeurIPS. arXiv:2203.15556.

Direct precursor and the paper it corrects:

  • Kaplan, J. et al. (2020). "Scaling Laws for Neural Language Models." arXiv:2001.08361. The OpenAI scaling-laws paper whose constants Chinchilla revises.

Direct application:

  • Rae, J. W. et al. (2021). "Scaling Language Models: Methods, Analysis & Insights from Training Gopher." arXiv:2112.11446. The 280B Gopher model that Chinchilla's recipe outperforms at equal compute.

Critical refinements:

  • Hoffmann, J. et al. (2024). "Chinchilla scaling: A replication attempt." arXiv:2404.10102. Replication note correcting confidence intervals in the original Approach 3 fit.
  • Sardana, N., & Frankle, J. (2024). "Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws." arXiv:2401.00448.
  • Wei, J. et al. (2022). "Emergent Abilities of Large Language Models." TMLR. arXiv:2206.07682. Why cross-entropy power laws don't always translate to benchmark scaling.

Follow-on work building on the Chinchilla prescription:

  • Touvron, H. et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971. The first Llama; explicitly Chinchilla-aware in its data allocation.
  • Touvron, H. et al. (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models." arXiv:2307.09288.
  • Dubey, A. et al. (2024). "The Llama 3 Herd of Models." arXiv:2407.21783. Departs from Chinchilla-optimal training in favour of inference-cost-aware overtraining.

Standard textbook:

  • Prince, S. J. D. (2023). Understanding Deep Learning. MIT Press. Chapter 12.7 — scaling laws.

Connected topics

Last reviewed: May 7, 2026