Paper breakdown

Scaling Laws for Neural Language Models

Jared Kaplan et al. · 2020 · arXiv preprint

Empirical power-law fits relating cross-entropy test loss to non-embedding parameter count, dataset size, and compute. The constants and the prescription for compute-optimal training planned GPT-3 and were the operating manual for foundation models for two years.

arXiv:2001.08361

Overview

Kaplan and collaborators (2020) trained over 200 transformer language models spanning seven orders of magnitude in compute, four orders of magnitude in non-embedding parameter count, and three orders of magnitude in tokens. They reported clean power-law fits relating cross-entropy test loss to each of those axes, with deviations only at the smallest sizes and at the compute frontier where data ran out.

The paper's prescription is concrete. Given a fixed compute budget $C$ , allocate it as model size $N$ and training tokens $D$ so that loss is minimized; the paper claims $N \propto C^{0.73}$ , $D \propto C^{0.27}$ . The dominant share goes to model size; only a small share to data. This was the operating manual for OpenAI through GPT-3 (2020) and most public LLM efforts through early 2022.

In March 2022, Hoffmann et al. at DeepMind ran a more careful experimental sweep at fixed compute and reported a different optimal allocation: $N$ and $D$ should scale roughly equally. Chinchilla (70B model trained on 1.4T tokens) outperformed Gopher (280B model on 300B tokens) on the same FLOP budget. The Kaplan paper was wrong on the constants. The framework — power laws in $N, D, C$ , with breakers when one of them becomes the bottleneck — survived and is still how foundation-model planning works.

Two takeaways. First, the form of the scaling laws is robust: cross-entropy descends as a power of compute over five orders of magnitude. Second, the constants depend on the experimental protocol — what learning rate schedule, what tokenizer, what context length — and are not portable across labs. Read both papers if you are planning a training run.

Mathematical Contributions

The three power laws

For a transformer trained to convergence on enough data at non-embedding parameter count $N$ and dataset size $D$ , the paper fits:

$L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}\!,\qquad L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}\!,\qquad L(C_{\min}) = \left(\frac{C_c}{C_{\min}}\right)^{\alpha_C}$

with reported exponents $\alpha_N \approx 0.076$ , $\alpha_D \approx 0.095$ , $\alpha_C \approx 0.050$ , and the constants $N_c \approx 8.8 \times 10^{13}$ , $D_c \approx 5.4 \times 10^{13}$ , $C_c \approx 3.1 \times 10^8$ in PF-days. $C_{\min}$ is the minimum compute at which a given loss is achievable, which differs from the paper's other compute axis $C$ that includes overheads.

Each fit holds when the corresponding resource is the bottleneck. If you have lots of data and parameters, compute is the bottleneck and $L \sim C^{-\alpha_C}$ . The paper shows the three curves are consistent and proposes a joint functional form combining them.

The joint loss surface

Section 3.6 fits a single function:

$L(N, D) = \left[\left(\frac{N_c}{N}\right)^{\alpha_N / \alpha_D} + \frac{D_c}{D}\right]^{\alpha_D}$

which separates into the $N$ -bottleneck and $D$ -bottleneck regimes. When $N$ is small, the first term dominates; when $D$ is small, the second. The crossover defines the compute-optimal allocation.

Compute-optimal allocation (Kaplan)

Computer cost scales as $C \approx 6 N D$ for transformer training (the factor 6 is the per-token forward and backward FLOPs). Minimizing $L(N, D)$ subject to $6ND = C$ via Lagrangians gives, after the paper's algebra:

$N_{\text{opt}}(C) \propto C^{0.73},\qquad D_{\text{opt}}(C) \propto C^{0.27}$

The exponents come from $\alpha_N$ and $\alpha_D$ . The implication is that if you 10× the compute, parameter count grows ~5.4× and tokens grow only ~1.85×. This is what justified the GPT-3 training recipe — 175B parameters trained on 300B tokens.

Why this was wrong on data

Hoffmann et al. (2022) re-ran the sweep at much smaller scale but with more careful hyperparameter tuning, particularly the learning-rate schedule, which Kaplan kept fixed at a budget that was anchored to the largest run. The Chinchilla fits give:

$L(N, D) = E + \frac{A}{N^{\alpha}} + \frac{B}{D^{\beta}}$

with $\alpha \approx 0.34$ , $\beta \approx 0.28$ , leading to:

$N_{\text{opt}}(C) \propto C^{0.50},\qquad D_{\text{opt}}(C) \propto C^{0.50}$

The mechanism for the discrepancy was traced to the learning-rate schedule: training small models with the same long schedule used for large models under-trains them, biasing the parameter exponent upward. Once the schedule was tuned per-run, the data exponent grew and the optimal allocation became roughly equal. See scaling compute-optimal training for the full comparison.

What the paper got right

The functional form (power laws over five orders of magnitude in compute) was confirmed by every subsequent independent replication. The fact that scaling alone — no architectural change, no algorithmic trick — predictably improves loss became the empirical foundation of foundation-model development. The rough order of magnitude for the constants was good enough to plan GPT-3 successfully, even if the optimal allocation was later corrected.

Other reported regularities

Section 4: model performance is insensitive to architecture details (depth, width, attention head count) at fixed $N$ . The dominant axis is parameter count, not architectural geometry. This justifies treating the transformer as a black-box scaling unit. Section 5: at fixed $C$ , the optimal batch size scales as $B \propto L^{-1/\alpha_C}$ — bigger models and lower target loss prefer larger batches, which became the rationale for billion-token batches.

Connections to TheoremPath Topics

Scaling laws — modern treatment including Chinchilla, Hoffmann et al., and the subsequent empirical re-fits.
Scaling: compute-optimal training — the corrected $N \approx D$ allocation.
Inference-time scaling laws — what happens on the test side: chain-of-thought length, sampling budget, search depth.
Transformer architecture — the architecture whose scaling is described.
GPT series evolution — GPT-2 to GPT-3 was the direct application of these laws.
Bits, nats, perplexity, BPB — units the loss is reported in.
The bitter lesson — Sutton's prior framing that scale beats hand-engineered method.

Why It Matters Now

This paper changed how research is planned. Before scaling laws, "make it bigger" was a vague suggestion; after, it was a quantitative recipe with a slope. Funded training runs went from "we hope scaling works" to "loss at this compute will be approximately this value." The error bars around that prediction are now small enough that some labs publish the predicted vs. actual loss curve as a sanity check.

Two warnings.

The constants are not portable. Kaplan got $\alpha_N \approx 0.076$ ; Chinchilla got $\alpha \approx 0.34$ . Same form, very different numbers. Anyone planning a run today needs to fit the constants to their own data mixture, tokenizer, and schedule, not borrow from the paper.

Cross-entropy is the metric. Power laws in cross-entropy do not imply power laws in any downstream metric. Wei et al. (2022) cataloged emergent capabilities that look like sharp transitions on benchmark accuracy as model size grows — the underlying loss decreases smoothly, but the task-success indicator goes from 0 to 1 over a narrow range. The cross-entropy power law is real; "more capability per FLOP" depends on what you measure.

The 2024-onward picture has added test-time compute as a third axis. OpenAI's o1 and DeepSeek-R1 demonstrated that more inference-time reasoning trades against pretraining compute, opening a second scaling law on the inference side. The pretraining-only picture from this 2020 paper is necessary but no longer sufficient.

References

Canonical:

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). "Scaling Laws for Neural Language Models." arXiv:2001.08361.

Critical correction:

Hoffmann, J. et al. (2022). "Training Compute-Optimal Large Language Models." NeurIPS. arXiv:2203.15556. Chinchilla. Re-runs the sweep with tuned schedules and corrects $N : D$ allocation to roughly $1 : 1$ .

Direct precursors:

Hestness, J. et al. (2017). "Deep Learning Scaling is Predictable, Empirically." arXiv:1712.00409. Earlier scaling-law observation across vision, NMT, language modeling.
Rosenfeld, J. S., Rosenfeld, A., Belinkov, Y., & Shavit, N. (2020). "A Constructive Prediction of the Generalization Error Across Scales." ICLR. arXiv:1909.12673. Earlier functional-form fits.

Direct application:

Brown, T. B. et al. (2020). "Language Models are Few-Shot Learners." NeurIPS. arXiv:2005.14165. GPT-3 — the run planned with these constants.

Follow-on work:

Henighan, T. et al. (2020). "Scaling Laws for Autoregressive Generative Modeling." arXiv:2010.14701. Same form across modalities.
Wei, J. et al. (2022). "Emergent Abilities of Large Language Models." TMLR. arXiv:2206.07682. Smooth cross-entropy can mask sharp benchmark transitions.
Schaeffer, R., Miranda, B., & Koyejo, S. (2023). "Are Emergent Abilities of Large Language Models a Mirage?" NeurIPS. arXiv:2304.15004. Argues many "emergent" capabilities are artifacts of metric choice.
Snell, C. et al. (2024). "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters." arXiv:2408.03314. The inference-time scaling axis.

Standard textbook:

Prince, S. J. D. (2023). Understanding Deep Learning. MIT Press. Chapter 12.7.

Connected topics

Last reviewed: May 5, 2026