Skip to main content

Paper breakdown

Scaling Laws for Neural Language Models

Jared Kaplan et al. · 2020 · arXiv preprint

Empirical power-law fits relating cross-entropy test loss to non-embedding parameter count, dataset size, and compute. The constants and the prescription for compute-optimal training planned GPT-3 and were the operating manual for foundation models for two years.

Overview

Kaplan and collaborators (2020) trained over 200 transformer language models spanning seven orders of magnitude in compute, four orders of magnitude in non-embedding parameter count, and three orders of magnitude in tokens. They reported clean power-law fits relating cross-entropy test loss to each of those axes, with deviations only at the smallest sizes and at the compute frontier where data ran out.

The paper's prescription is concrete. Given a fixed compute budget CC, allocate it as model size NN and training tokens DD so that loss is minimized; the paper claims NC0.73N \propto C^{0.73}, DC0.27D \propto C^{0.27}. The dominant share goes to model size; only a small share to data. This was the operating manual for OpenAI through GPT-3 (2020) and most public LLM efforts through early 2022.

In March 2022, Hoffmann et al. at DeepMind ran a more careful experimental sweep at fixed compute and reported a different optimal allocation: NN and DD should scale roughly equally. Chinchilla (70B model trained on 1.4T tokens) outperformed Gopher (280B model on 300B tokens) on the same FLOP budget. The Kaplan paper was wrong on the constants. The framework — power laws in N,D,CN, D, C, with breakers when one of them becomes the bottleneck — survived and is still how foundation-model planning works.

Two takeaways. First, the form of the scaling laws is robust: cross-entropy descends as a power of compute over five orders of magnitude. Second, the constants depend on the experimental protocol — what learning rate schedule, what tokenizer, what context length — and are not portable across labs. Read both papers if you are planning a training run.

Mathematical Contributions

The three power laws

For a transformer trained to convergence on enough data at non-embedding parameter count NN and dataset size DD, the paper fits:

L(N)=(NcN)αN ⁣,L(D)=(DcD)αD ⁣,L(Cmin)=(CcCmin)αCL(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}\!,\qquad L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}\!,\qquad L(C_{\min}) = \left(\frac{C_c}{C_{\min}}\right)^{\alpha_C}

with reported exponents αN0.076\alpha_N \approx 0.076, αD0.095\alpha_D \approx 0.095, αC0.050\alpha_C \approx 0.050, and the constants Nc8.8×1013N_c \approx 8.8 \times 10^{13}, Dc5.4×1013D_c \approx 5.4 \times 10^{13}, Cc3.1×108C_c \approx 3.1 \times 10^8 in PF-days. CminC_{\min} is the minimum compute at which a given loss is achievable, which differs from the paper's other compute axis CC that includes overheads.

Each fit holds when the corresponding resource is the bottleneck. If you have lots of data and parameters, compute is the bottleneck and LCαCL \sim C^{-\alpha_C}. The paper shows the three curves are consistent and proposes a joint functional form combining them.

The joint loss surface

Section 3.6 fits a single function:

L(N,D)=[(NcN)αN/αD+DcD]αDL(N, D) = \left[\left(\frac{N_c}{N}\right)^{\alpha_N / \alpha_D} + \frac{D_c}{D}\right]^{\alpha_D}

which separates into the NN-bottleneck and DD-bottleneck regimes. When NN is small, the first term dominates; when DD is small, the second. The crossover defines the compute-optimal allocation.

Compute-optimal allocation (Kaplan)

Computer cost scales as C6NDC \approx 6 N D for transformer training (the factor 6 is the per-token forward and backward FLOPs). Minimizing L(N,D)L(N, D) subject to 6ND=C6ND = C via Lagrangians gives, after the paper's algebra:

Nopt(C)C0.73,Dopt(C)C0.27N_{\text{opt}}(C) \propto C^{0.73},\qquad D_{\text{opt}}(C) \propto C^{0.27}

The exponents come from αN\alpha_N and αD\alpha_D. The implication is that if you 10× the compute, parameter count grows ~5.4× and tokens grow only ~1.85×. This is what justified the GPT-3 training recipe — 175B parameters trained on 300B tokens.

Why this was wrong on data

Hoffmann et al. (2022) re-ran the sweep at much smaller scale but with more careful hyperparameter tuning, particularly the learning-rate schedule, which Kaplan kept fixed at a budget that was anchored to the largest run. The Chinchilla fits give:

L(N,D)=E+ANα+BDβL(N, D) = E + \frac{A}{N^{\alpha}} + \frac{B}{D^{\beta}}

with α0.34\alpha \approx 0.34, β0.28\beta \approx 0.28, leading to:

Nopt(C)C0.50,Dopt(C)C0.50N_{\text{opt}}(C) \propto C^{0.50},\qquad D_{\text{opt}}(C) \propto C^{0.50}

The mechanism for the discrepancy was traced to the learning-rate schedule: training small models with the same long schedule used for large models under-trains them, biasing the parameter exponent upward. Once the schedule was tuned per-run, the data exponent grew and the optimal allocation became roughly equal. See scaling compute-optimal training for the full comparison.

What the paper got right

The functional form (power laws over five orders of magnitude in compute) was confirmed by every subsequent independent replication. The fact that scaling alone — no architectural change, no algorithmic trick — predictably improves loss became the empirical foundation of foundation-model development. The rough order of magnitude for the constants was good enough to plan GPT-3 successfully, even if the optimal allocation was later corrected.

Other reported regularities

Section 4: model performance is insensitive to architecture details (depth, width, attention head count) at fixed NN. The dominant axis is parameter count, not architectural geometry. This justifies treating the transformer as a black-box scaling unit. Section 5: at fixed CC, the optimal batch size scales as BL1/αCB \propto L^{-1/\alpha_C} — bigger models and lower target loss prefer larger batches, which became the rationale for billion-token batches.

Connections to TheoremPath Topics

Why It Matters Now

This paper changed how research is planned. Before scaling laws, "make it bigger" was a vague suggestion; after, it was a quantitative recipe with a slope. Funded training runs went from "we hope scaling works" to "loss at this compute will be approximately this value." The error bars around that prediction are now small enough that some labs publish the predicted vs. actual loss curve as a sanity check.

Two warnings.

The constants are not portable. Kaplan got αN0.076\alpha_N \approx 0.076; Chinchilla got α0.34\alpha \approx 0.34. Same form, very different numbers. Anyone planning a run today needs to fit the constants to their own data mixture, tokenizer, and schedule, not borrow from the paper.

Cross-entropy is the metric. Power laws in cross-entropy do not imply power laws in any downstream metric. Wei et al. (2022) cataloged emergent capabilities that look like sharp transitions on benchmark accuracy as model size grows — the underlying loss decreases smoothly, but the task-success indicator goes from 0 to 1 over a narrow range. The cross-entropy power law is real; "more capability per FLOP" depends on what you measure.

The 2024-onward picture has added test-time compute as a third axis. OpenAI's o1 and DeepSeek-R1 demonstrated that more inference-time reasoning trades against pretraining compute, opening a second scaling law on the inference side. The pretraining-only picture from this 2020 paper is necessary but no longer sufficient.

References

Canonical:

  • Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). "Scaling Laws for Neural Language Models." arXiv:2001.08361.

Critical correction:

  • Hoffmann, J. et al. (2022). "Training Compute-Optimal Large Language Models." NeurIPS. arXiv:2203.15556. Chinchilla. Re-runs the sweep with tuned schedules and corrects N:DN : D allocation to roughly 1:11 : 1.

Direct precursors:

  • Hestness, J. et al. (2017). "Deep Learning Scaling is Predictable, Empirically." arXiv:1712.00409. Earlier scaling-law observation across vision, NMT, language modeling.
  • Rosenfeld, J. S., Rosenfeld, A., Belinkov, Y., & Shavit, N. (2020). "A Constructive Prediction of the Generalization Error Across Scales." ICLR. arXiv:1909.12673. Earlier functional-form fits.

Direct application:

  • Brown, T. B. et al. (2020). "Language Models are Few-Shot Learners." NeurIPS. arXiv:2005.14165. GPT-3 — the run planned with these constants.

Follow-on work:

  • Henighan, T. et al. (2020). "Scaling Laws for Autoregressive Generative Modeling." arXiv:2010.14701. Same form across modalities.
  • Wei, J. et al. (2022). "Emergent Abilities of Large Language Models." TMLR. arXiv:2206.07682. Smooth cross-entropy can mask sharp benchmark transitions.
  • Schaeffer, R., Miranda, B., & Koyejo, S. (2023). "Are Emergent Abilities of Large Language Models a Mirage?" NeurIPS. arXiv:2304.15004. Argues many "emergent" capabilities are artifacts of metric choice.
  • Snell, C. et al. (2024). "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters." arXiv:2408.03314. The inference-time scaling axis.

Standard textbook:

  • Prince, S. J. D. (2023). Understanding Deep Learning. MIT Press. Chapter 12.7.

Connected topics

Last reviewed: May 5, 2026