Skip to main content

LLM Construction

Scaling Laws

Power-law scaling of LLM loss in parameters, data, and compute: Kaplan, Chinchilla, the Muennighoff data-constrained law for repetition, the Schaeffer metric-induced-emergence proposition, MoE and muP extensions, and the test-time compute axis.

AdvancedTier 1CurrentCore spine~85 min

Why This Matters

Scaling Law Board

Smooth power law, expensive mistakes

The clean frontier says loss falls predictably with compute. The dashed curve shows what happens when the run is compute-heavy but data-poor: more FLOPs do not buy the expected loss.

observed runscompute-optimal fitundertrained run
2.02.53.03.54.01e181e201e221e24training compute, log scalevalidation loss

Move the budget marker to see what the fitted scaling law would predict.

1e22.1 FLOPs

This exaggerates the penalty for spending compute without enough training tokens.

gap 0.16

law to read

Straight on log-log means one exponent controls the trend.

Predicted loss

2.85

What the clean frontier expects at this compute level.

Undertrained loss

3.01

A run that used compute but did not scale data with it.

What to learn

Scaling laws are not magic forecasts. They assume the experiment stays on the same recipe frontier.

Scaling laws are empirical power-law fits that relate a model's loss to the number of parameters NN, the training tokens DD, and the total compute CC. They are regressions, not physical laws: they hold within the compute regime where they were fit, and they can break under architectural changes, data repetition, or distribution shift.

Within their regime they are useful. They have guided training decisions worth billions of dollars: how large to make a model, how much data to collect, and how to allocate a fixed compute budget between model size and training duration. They connect directly to compute-optimal training and to the transformer architecture that all modern LLMs share. The Kaplan et al. paper breakdown walks through the original 2020 power-law fits and the Chinchilla correction.

Mental Model

Imagine you have a fixed compute budget (say, 102410^{24} FLOPs). You must decide: train a large model for fewer steps, or a smaller model for more steps? Scaling laws answer this question precisely. They tell you that loss decreases as a power law in each resource, and that there is an optimal way to split your budget between model size and data.

The key surprise: loss follows smooth, predictable power laws over many orders of magnitude. A model with 10x more parameters trained on the same data will have predictably lower loss. This predictability is what makes scaling laws practically useful: one can extrapolate from small experiments to forecast the performance of much larger runs.

The Kaplan Scaling Laws (2020)

Definition

Kaplan Power-Law Scaling

Kaplan et al. (2020) empirically observed that cross-entropy loss LL on language modeling scales as power laws in parameters NN, data DD, and compute CC, when each is varied independently with the others held sufficient:

L(N)=(NcN)αN,L(D)=(DcD)αD,L(C)=(CcC)αCL(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}, \qquad L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}, \qquad L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C}

where Nc,Dc,CcN_c, D_c, C_c are scaling constants and the exponents are approximately:

  • αN0.076\alpha_N \approx 0.076 (loss vs parameters)
  • αD0.095\alpha_D \approx 0.095 (loss vs data)
  • αC0.050\alpha_C \approx 0.050 (loss vs compute)

These power laws hold over at least 7 orders of magnitude in compute.

Proposition

Power-Law Scaling of Language Model Loss

Statement

When parameters NN and data DD are both potentially limiting, the loss follows an approximate decomposition:

L(N,D)[(NcN)αN/β+(DcD)αD/β]βL(N, D) \approx \left[\left(\frac{N_c}{N}\right)^{\alpha_N / \beta} + \left(\frac{D_c}{D}\right)^{\alpha_D / \beta}\right]^{\beta}

for fitted constants. In the regime where one factor dominates, this reduces to the individual power laws above.

Reading the exponents directly: a larger exponent means loss falls faster as that resource grows. Since αD>αN\alpha_D > \alpha_N, in the isolated power laws loss is more sensitive to data than to model size. Kaplan's recommendation to favor model size at a fixed compute budget came from the joint compute-allocation analysis (using C6NDC \approx 6ND and the joint fit for L(N,D)L(N, D)), not from the individual exponents. Chinchilla later revisited that allocation analysis and reached the opposite conclusion.

Intuition

A power law LNαL \propto N^{-\alpha} means that each 10x increase in NN gives a fixed percentage reduction in loss. The exponent α\alpha determines how fast loss improves. With αD>αN\alpha_D > \alpha_N, a 10x increase in data reduces loss more than a 10x increase in parameters does, in the regime where the other factor is not the bottleneck. The compute-allocation story is separate: at fixed C6NDC \approx 6ND, spending a unit of compute on more parameters vs more tokens depends on the joint fit, and that is the analysis Kaplan used to recommend scaling NN faster than DD.

Why It Matters

The Kaplan scaling laws directly influenced the training of GPT-3 (175B parameters trained on 300B tokens). The recommendation to favor large models with moderate data was the dominant paradigm from 2020 to 2022. It was overturned by the Chinchilla analysis.

Failure Mode

The Kaplan analysis trained all models for a relatively small number of tokens and extrapolated. Crucially, it did not train smaller models to full convergence—biasing the analysis toward large models. The Chinchilla paper corrected this methodological issue and reached the opposite conclusion about optimal allocation.

Chinchilla Scaling (2022)

Theorem

Chinchilla-Optimal Compute Allocation

Statement

Hoffmann et al. (2022) showed that for a fixed compute budget CC, the optimal number of parameters NN^* and training tokens DD^* both scale proportionally to the square root of compute:

NCa,DCbN^* \propto C^a, \qquad D^* \propto C^b

with a+b=1a + b = 1 from the FLOP constraint C6NDC \approx 6ND. Hoffmann et al.'s parametric fit gives a0.452a \approx 0.452 and b0.548b \approx 0.548 (their Approach 3); the two exponents are close but not equal, so NN^* and DD^* scale together with compute but at slightly different rates rather than being strictly proportional.

Parameters and training tokens should scale roughly together with compute. At Chinchilla's compute budget, the empirical compute-optimal allocation is about D/N20D^* / N^* \approx 20 tokens per parameter. Because b>ab > a, this ratio drifts upward slowly with compute (as CbaC0.1C^{b-a} \approx C^{0.1}) rather than staying fixed. The "doubling parameters means doubling tokens" slogan is the a=b=1/2a = b = 1/2 idealization, accurate to leading order but not exact.

The "20 tokens per parameter" figure is an empirical artifact of those specific exponents on Hoffmann et al.'s corpus, not a universal constant. Refits on different model families and corpora yield different ratios, and most modern open models intentionally over-train relative to this ratio because inference cost is dominated by NN alone.

Intuition

Kaplan said: make the model as large as possible. Chinchilla said: balance model size and data. The difference comes from how you define "optimal." If you fix compute and ask "what achieves the lowest loss?", Chinchilla shows that an overparameterized, undertrained model wastes compute. Training a smaller model on more data reaches the same loss with the same compute budget.

Proof Sketch

The analysis fits a parametric loss function:

L(N,D)=ANα+BDβ+EL(N, D) = \frac{A}{N^\alpha} + \frac{B}{D^\beta} + E

where EE is the irreducible entropy of natural language. Subject to the constraint C=6NDC = 6ND (total compute), minimize LL over (N,D)(N, D).

Using Lagrange multipliers: at the optimum, LN/LD=D/N\frac{\partial L}{\partial N} / \frac{\partial L}{\partial D} = D/N, which gives αA/Nα+1N=βB/Dβ+1D\alpha A / N^{\alpha+1} \cdot N = \beta B / D^{\beta+1} \cdot D. Hoffmann et al. fit α0.34\alpha \approx 0.34 and β0.28\beta \approx 0.28 (close, but not equal). Because the two exponents are nearly equal, the optimum scales as NCaN^* \propto C^{a} and DCbD^* \propto C^{b} with a,ba, b both close to 0.50.5. The exact exponents depend on (α,β)(\alpha, \beta); the widely-quoted NC1/2N^* \propto C^{1/2} result is what you get in the α=β\alpha = \beta limit.

Why It Matters

Chinchilla (70B parameters, 1.4T tokens) matched the performance of Gopher (280B parameters, 300B tokens) with 4x fewer parameters and the same compute budget. This result reshaped the industry: Llama, DeepSeek, Mistral, and most subsequent models are "Chinchilla-optimal" or even "over-trained" (using more data than Chinchilla-optimal for a given size, because inference cost depends on NN while training cost depends on CC).

Failure Mode

Chinchilla-optimal minimizes loss per FLOP of training. But inference cost depends on NN, not DD. If you plan to serve a model to millions of users, it is cheaper to over-train a small model (use more data than Chinchilla recommends) than to serve a Chinchilla-optimal larger model. Llama 3 8B was trained on 15T tokens—roughly 1875 tokens per parameter, nearly 100x the Chinchilla ratio—because the inference savings from a smaller model outweigh the training compute cost.

The Compute Constraint: C6NDC \approx 6ND

Definition

Training Compute Estimate

For a decoder-only transformer with NN parameters, one forward pass on one token requires approximately 2N2N FLOPs (one multiply-add per parameter). The backward pass requires approximately 4N4N FLOPs (roughly 2x the forward pass). Training on DD tokens requires:

C6ND FLOPsC \approx 6ND \text{ FLOPs}

This is the "6ND rule." It ignores attention cost (O(n2d)O(n^2 d) per layer) but is accurate within a factor of 2 for typical model sizes and context lengths.

This formula is the bridge between the abstract scaling laws and concrete training decisions. Given a GPU cluster with a known FLOP budget, you can directly compute the Chinchilla-optimal NN and DD.

Emergent Abilities at Scale

Definition

Emergent Abilities

An ability is described as emergent when it is absent in smaller models but appears in larger models. Wei et al. (2022) documented several tasks where model performance was near random below a critical scale and then sharply improved:

  • Few-shot arithmetic: near zero below ~10B parameters, then rapid improvement
  • Multi-step reasoning: absent in small models, present in large ones
  • Code generation: qualitative jump at sufficient scale

The claim: some capabilities emerge discontinuously as a function of scale, rather than improving smoothly.

Schaeffer, Miranda, and Koyejo (2023) argued that the apparent discontinuity is a property of the evaluation metric, not the model. Their argument is sharp and worth stating as a proposition.

Proposition

Metric-Induced Emergence (Schaeffer-Miranda-Koyejo)

Statement

Let p(N)p(N) be the per-token probability of producing the correct next token, viewed as a smooth function of model scale NN (for example, a power law p(N)=1cNαp(N) = 1 - c N^{-\alpha}). The probability of an exact-match correct answer on a length-LL sequence is approximately

Acc(N)=p(N)L=(1cNα)L.\text{Acc}(N) = p(N)^L = \left(1 - c N^{-\alpha}\right)^L.

For any fixed LL this curve has an inflection point that becomes arbitrarily sharp as LL grows. With L=5L = 5 to 2020 (typical for arithmetic and multi-hop reasoning), the curve transitions from 0\approx 0 to 1\approx 1 within roughly one order of magnitude in NN, even though p(N)p(N) itself varies smoothly across many orders of magnitude.

If the same data is rescored with a continuous metric (token edit distance, Brier score, log-probability of the gold answer), the apparent discontinuity disappears and the curve recovers the smooth scaling of p(N)p(N).

Empirically, Schaeffer et al. (2023) showed that on the Big-Bench tasks exhibiting "emergence" under exact match, switching to token edit distance or Brier score eliminated the sharp transition. They reproduced the mechanism on a controlled InstructGPT integer arithmetic experiment and even induced apparent emergence on a vision benchmark by changing only the metric.

Intuition

A long-string exact-match metric is a many-to-one nonlinearity: Acc(N)=1[p(N)L>τ]\text{Acc}(N) = \mathbb{1}[p(N)^L > \tau] for some threshold τ\tau, which in expectation behaves like a step function of pp. Smooth improvements in pp get compressed into a narrow band of NN values. The model is not undergoing a phase transition; the metric is.

Proof Sketch

For independent per-token correctness with probability pp, Acc=pL\text{Acc} = p^L. Differentiating, dAcc/dp=LpL1d\text{Acc}/dp = L p^{L-1}, which peaks at the inflection p=1p = 1 and grows linearly with LL. So the slope of the accuracy curve in NN is LpL1p(N)L p^{L-1} \cdot p'(N), scaling with LL. For L1L \gg 1 the slope concentrates at pp near 11, producing the visual impression of a sudden jump. The exact-claim experiment in Schaeffer et al. (2023) is the InstructGPT family on kk-digit integer addition: per-token log-prob of the gold digits scales smoothly, sequence-level accuracy shows the canonical "emergent" curve, and Brier-score rescoring recovers the smooth trend.

Why It Matters

The proposition does not say emergent abilities never happen. It says the default evaluation pipeline manufactures the appearance of emergence even when the underlying capability is improving smoothly. Two consequences for practice. (i) Forecasting: pretraining loss extrapolates well, but exact-match downstream benchmarks do not, even when the underlying capability is following the same smooth trend. (ii) Safety: a narrative of "sharp, unpredictable capability jumps" should not be supported by evidence that is generated entirely by a discontinuous metric.

Failure Mode

The (1cNα)L(1 - c N^{-\alpha})^L model is an idealization. (a) Per-token errors are not independent: a wrong token early in a chain biases later tokens, so the true accuracy is below pLp^L. (b) Some tasks have a genuine threshold structure not captured by independent-token accuracy (e.g., a model that either knows an algorithm or does not). For these, smooth metrics may also show a visible kink. (c) The proposition rules out one explanation for emergence; it does not prove the negation. Whether a given claimed emergent ability survives a continuous-metric audit is a per-task empirical question.

The empirical picture after Schaeffer et al.: on BIG-Bench, the majority of tasks formerly cited as exhibiting emergence become smooth under continuous metrics. A residual fraction (some compositional reasoning tasks in particular) retains a kink even under smooth metrics. The "are emergent abilities real?" question splits cleanly: metric-induced emergence is widespread and well-explained; capability-driven threshold behavior exists but is narrower than originally claimed.

The Decomposed Scaling Law

Definition

Decomposed Scaling Law (functional form)

Assume a transformer language model trained near convergence, with compute allocated compute-optimally per Hoffmann et al. (2022). The cross-entropy loss is modeled as:

L(N,D)=E+ANα+BDβL(N, D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}

where:

  • NN is the parameter count, DD is the training token count.
  • E0E \geq 0 is a loss floor for the given data distribution and tokenizer. the loss of an infinite model trained on infinite data.
  • A,B>0A, B > 0 and α,β>0\alpha, \beta > 0 are fitted constants.
  • A/NαA / N^\alpha captures limited model capacity.
  • B/DβB / D^\beta captures limited dataset information.

As NN \to \infty or DD \to \infty, the corresponding term vanishes, but the loss cannot go below EE.

Example

Empirical fits (Chinchilla, Kaplan)

Hoffmann et al. (2022) fit α0.34\alpha \approx 0.34, β0.28\beta \approx 0.28, E1.69E \approx 1.69 nats on their corpus. Kaplan et al. (2020) report a different irreducible floor and different exponents on their WebText2-style corpus. These values are observations, not universal constants: they vary with data distribution, language, tokenization, architecture, and fitting procedure. Treat them as calibrated priors for a given setup, not as fundamental parameters of nature.

Even so, the decomposition is practically predictive. Once AA, BB, α\alpha, β\beta, EE are fit on small-scale runs, the extrapolated loss of much larger runs (Chinchilla, Llama, GPT-4 class) has matched measured loss within a few percent. This is what justifies the capital expenditure on large training runs.

The predictive power is restricted to pretraining loss. Downstream task accuracy can be nonlinear in loss: small loss improvements may produce large capability jumps or none at all.

Proposition

Compute-Optimal Allocation under the Decomposed Form

Statement

Minimize L(N,D)=E+ANα+BDβL(N, D) = E + A N^{-\alpha} + B D^{-\beta} subject to the compute constraint C=6NDC = 6ND. The optimal (N,D)(N^*, D^*) satisfies:

NCβ/(α+β),DCα/(α+β).N^* \propto C^{\beta / (\alpha + \beta)}, \qquad D^* \propto C^{\alpha / (\alpha + \beta)}.

The two exponents sum to 1, consistent with C=6NDC = 6 N D.

Intuition

Each term ANαA N^{-\alpha} and BDβB D^{-\beta} is convex in its variable. The constraint ND=C/6ND = C/6 is a hyperbola in (N,D)(N, D). The minimum is where the marginal loss reduction per unit compute is equal across NN and DD. Allocating more compute to the resource with the faster-decaying term gives diminishing returns sooner, so the optimum balances the two exponents.

Proof Sketch

Form the Lagrangian L=ANα+BDβ+λ(NDC/6)\mathcal{L} = A N^{-\alpha} + B D^{-\beta} + \lambda(ND - C/6). First-order conditions:

αANα1+λD=0,βBDβ1+λN=0.-\alpha A N^{-\alpha - 1} + \lambda D = 0, \qquad -\beta B D^{-\beta - 1} + \lambda N = 0.

Eliminating λ\lambda gives αADβ=βBNα\alpha A D^{\beta} = \beta B N^{\alpha}, so DβNαD^\beta \propto N^\alpha. Combined with ND=C/6ND = C/6, solve for NN^* in CC:

Nα+βCβ    NCβ/(α+β),N^{\alpha + \beta} \propto C^\beta \implies N^* \propto C^{\beta / (\alpha + \beta)},

and symmetrically DCα/(α+β)D^* \propto C^{\alpha / (\alpha + \beta)}.

Why It Matters

This is the provable skeleton behind Chinchilla-style allocation rules. Whatever the fitted (α,β)(\alpha, \beta), the Lagrangian argument fixes the functional form of optimal allocation. Only the specific exponent ratio depends on the empirical fit.

Failure Mode

The result assumes the functional form L=E+ANα+BDβL = E + A N^{-\alpha} + B D^{-\beta} holds globally. If the true loss deviates (data repetition, data-constrained regimes, curriculum effects, architectural phase transitions), the allocation prescription breaks. The 6ND6ND FLOP estimate also ignores attention cost, which grows with context length and can dominate in long-context training.

Data-Constrained Scaling

Chinchilla and the decomposed law assume each training token is unique. Frontier runs have moved past the easy-to-curate web: Llama 3 trained on 15T tokens, Qwen and DeepSeek on similar scales, and the supply of high-quality unique text in major languages is finite. The relevant question becomes: when you must repeat data, how much does it cost?

Muennighoff et al. (2023) ran the experiment (\sim400 model variants, up to 9B parameters, 900B training tokens) and fit a clean modification of the Chinchilla form.

Theorem

Data-Constrained Scaling Law (Muennighoff)

Statement

Let UDU_D denote the unique tokens in the corpus and UNU_N the compute-optimal Chinchilla parameters for those tokens. With RD=D/UDR_D = D / U_D repetitions and RN=N/UNR_N = N / U_N excess parameters, the loss is

L(N,D)=E+A(N)α+B(D)βL(N, D) = E + \frac{A}{(N')^{\alpha}} + \frac{B}{(D')^{\beta}}

with effective counts that exponentially saturate:

D=UD+UDRD(1eRD/RD),N=UN+UNRN(1eRN/RN)D' = U_D + U_D \cdot R_D^{*} \cdot \left(1 - e^{-R_D / R_D^{*}}\right), \qquad N' = U_N + U_N \cdot R_N^{*} \cdot \left(1 - e^{-R_N / R_N^{*}}\right)

The fitted half-life RD15R_D^{*} \approx 15 (so each repeated token loses 1/e\sim 1/e of its value after 16\sim 16 epochs). The corresponding RNR_N^{*} is smaller, so excess parameters decay faster than repeated data. Empirically:

  • Up to 4\sim 4 epochs of repetition produce loss within noise of training on fresh data.
  • Beyond RDR_D^{*}, the marginal value of additional repetition decays to zero; spending the same compute on more parameters or stopping training becomes preferable.
  • Code data tolerates more repetition than natural-language text. Filtering augmentations (e.g., perplexity-filtered repeats) extend the effective half-life modestly but do not change the qualitative shape.

Intuition

Repeating a token that the model has already memorized provides almost no new information. The first few epochs still extract residual structure (rare n-grams, long-range dependencies). After enough repetitions, the loss term B/DβB / D'^{\beta} saturates because DD' stops growing, and additional compute is wasted unless reallocated to capacity. The exponential form is empirical, but it has a clean reading: each new epoch contributes geometrically less effective data than the previous one.

Proof Sketch

The functional form is fitted, not derived. A heuristic for why it is roughly the right form: assume each repeated pass extracts a fraction ρ<1\rho < 1 of the residual information. The total information extracted after RR passes is the geometric sum

1+ρ+ρ2++ρR1=1ρR1ρ.1 + \rho + \rho^2 + \cdots + \rho^{R-1} = \frac{1 - \rho^R}{1 - \rho}.

This has the same saturating shape as 1eR/R1 - e^{-R / R^*} with R=1/logρR^* = -1 / \log \rho. Muennighoff et al. fit the exponential rather than the geometric form because it generalized better across model sizes and corpora.

Why It Matters

The data-constrained law turns "are we running out of data?" into a precise quantity. Given a unique-token budget UDU_D and a compute budget CC, you can solve for the allocation that minimizes loss subject to the constraint that D=UDRD = U_D \cdot R. The result: in data-constrained regimes the optimal NN shifts upward relative to fresh-data Chinchilla (more parameters, fewer effective tokens). It also bounds how much synthetic-data and augmentation-only strategies can buy you: if augmented samples behave like repeats, the half-life is the binding constraint. Recent practice (DeepSeek, Llama 3, Qwen) is consistent with the 4\sim 4 epoch finding.

Failure Mode

Three caveats. (a) The fit was done on web-scale natural-language corpora with a particular filtering pipeline; the effective RDR_D^{*} depends on data diversity and filtering. High-quality curated repeats (textbook-style) behave better than uniform random repeats. (b) The model assumes the same loss objective and tokenizer throughout. Curriculum changes (different data mixes per epoch) violate the additive form. (c) The law speaks to validation loss. Downstream task performance has its own dependence on data freshness and may degrade more or less sharply than loss does.

Watch Out

Repetition is not literally free up to 4 epochs

The "4\sim 4 epochs is fine" result is about loss on a held-out distribution, fit at the granularity of 1\sim 1% noise. Specific downstream benchmarks can degrade earlier under repetition (especially memorization- sensitive evaluations like exact-string recall) or later (compositional benchmarks where the model needs more passes to learn rare structures). Use R4R \leq 4 as a useful default, not as a guarantee.

Scaling for Downstream Tasks vs Pretraining Loss

Pretraining loss scales smoothly and predictably. Downstream task performance does not always follow the same smooth curve:

  • Smooth scaling: Tasks well-correlated with language modeling (text generation quality, perplexity) scale smoothly with loss.
  • Threshold scaling: Tasks requiring specific capabilities (multi-step reasoning, tool use) may show sharp transitions at particular loss levels.
  • Saturation: Some tasks saturate quickly (sentiment analysis) while others continue improving with scale (complex reasoning).

This means you cannot simply extrapolate a scaling curve for an individual task. You can predict the loss of a 10x larger model with high confidence, but predicting whether it will pass a specific evaluation benchmark requires understanding the relationship between loss and that specific capability.

Scaling Beyond Dense Transformers

Kaplan and Chinchilla fit dense decoder-only transformers. Two extensions matter in current practice: mixture-of-experts (MoE) scaling, and width-depth parametrization.

MoE scaling laws. Mixture-of-experts models activate only a subset of parameters per token. The active parameter count NactN_{\text{act}} governs inference cost and largely governs loss, but total parameters NtotN_{\text{tot}} also contribute via increased capacity. Krajewski et al. (2024) fit a joint scaling law L(Nact,Ntot,D)=E+A/Nactαa+B/Ntotαt+C/DβL(N_{\text{act}}, N_{\text{tot}}, D) = E + A / N_{\text{act}}^{\alpha_a} + B / N_{\text{tot}}^{\alpha_t} + C / D^{\beta} and find that the compute-optimal allocation shifts: under a serving-cost constraint that penalizes NactN_{\text{act}}, a given loss is reached more cheaply by a sparse model with larger NtotN_{\text{tot}} and smaller NactN_{\text{act}}. DeepSeek MoE (Dai et al. 2024) and DeepSeek-V3 pushed this regime with fine-grained expert segmentation (many small experts, high routing capacity) and reported competitive loss at roughly an order of magnitude fewer active FLOPs than dense baselines at the same quality.

The practical takeaway: Chinchilla's D/N20D^* / N^* \approx 20 was derived for dense models. MoE runs often operate at much higher token-to-active-parameter ratios because active parameters are cheap to scale without increasing the training-time FLOP cost per token in the same way.

muP (maximal update parametrization). Yang and Hu (2022) showed that with a specific per-layer rescaling of initialization variance, learning rates, and output multipliers, the optimal hyperparameters (learning rate, weight decay, initialization scale) become approximately invariant to width. This means a small proxy model can be tuned cheaply, and the same hyperparameters transfer to a much larger target model without re-tuning. Tensor Program V (Yang et al. 2022) extends this to depth transfer and more general architectures. GPT-4 scale training runs reportedly use muP-style transfer to avoid re-sweeping hyperparameters at every width. This is a scaling-law story in a different sense: the tuning cost no longer scales with NN, which changes the economics of large runs. See also convex tinkering for the broader principle of bounded-downside scale-up experiments.

Test-Time Compute Scaling

Kaplan and Chinchilla describe how loss scales with compute spent at training time. A separate axis has become central since 2024: how performance scales with compute spent at inference time on a fixed trained model. OpenAI o1 and DeepSeek R1 are the canonical examples. models trained with reinforcement learning to produce long chains of thought that consume far more tokens per query than standard decoding.

The headline empirical result: on reasoning-heavy tasks, adding inference compute to a smaller model can match or beat the accuracy of a much larger model decoded once. The training-compute-versus-inference-compute trade is not fixed by the Chinchilla analysis, which only accounts for pretraining loss.

Definition

Best-of-N Sampling

Given a prompt xx, draw NN independent completions y1,,yNy_1, \dots, y_N from the model at temperature T>0T > 0, then select one using a scoring rule s(x,y)s(x, y):

y^=argmaxi{1,,N}s(x,yi).\hat{y} = \arg\max_{i \in \{1, \dots, N\}} s(x, y_i).

The score ss can be a learned verifier (process reward model, outcome reward model), a ground-truth checker for tasks with verifiable answers (math, code unit tests), or majority vote over final answers (self-consistency). Inference FLOPs scale as roughly NN times the cost of a single completion.

Coverage scales predictably with samples. Brown et al. (2024), "Large Language Monkeys," measured pass@NN (the fraction of problems solved by at least one of NN samples) across math and code benchmarks. Coverage follows an approximate power law in NN over several orders of magnitude: log pass@NN is roughly linear in logN\log N. This is a different object from pass@1: it isolates what the model can produce from what a selector does select. A large gap between pass@NN and best-of-NN accuracy indicates a verifier bottleneck, not a generation bottleneck.

Test-time compute can substitute for parameters. Snell et al. (2024), "Scaling LLM Test-Time Compute Optimally," studied how a fixed compute budget should be spent across pretraining and inference: one large model decoded once, a small model sampled many times with a verifier, or tree search guided by a process reward model. On MATH-class problems, a smaller model with an optimal test-time strategy matched the accuracy of a roughly 14×14\times larger pretrained model decoded greedily once, when total compute (pretraining FLOPs plus inference FLOPs aggregated across the expected number of queries) is held equal. The optimal strategy depends on problem difficulty: easy problems favor more parallel samples, hard problems favor sequential revision and search. The small-model advantage shrinks or reverses on the hardest problems, where a single large-model rollout reasons over modes that no amount of small-model sampling can cover.

Reasoning models as RL over chains of thought. OpenAI o1 and DeepSeek R1 are trained with RL against verifiable rewards to produce long internal reasoning traces before answering. The scaling axis here is not samples but tokens per answer. Reported curves show accuracy rising smoothly with thinking-token budget, analogous to a scaling law but with inference tokens rather than training tokens on the xx-axis. The mechanism is not fully characterized: candidates include genuine search inside the trace, error correction, and in-context amortization of otherwise parametric knowledge.

Example

Inference budget trade-off

A team has a fixed per-query inference budget of BB FLOPs and wants to maximize accuracy on a reasoning benchmark. Two options:

  1. Deploy a 70B70B-parameter model and decode once. Cost per query is roughly 270109L2 \cdot 70 \cdot 10^9 \cdot L FLOPs for an output of length LL.
  2. Deploy an 8B8B-parameter model and sample N=20N = 20 candidates, then select with a verifier of comparable cost. Cost per query is roughly 2028109L20 \cdot 2 \cdot 8 \cdot 10^9 \cdot L FLOPs, also about 3×1011L3 \times 10^{11} L.

Under Snell et al.'s empirical curves on MATH, the small-model-with-verifier option wins at equal inference FLOPs when the verifier is well-calibrated and problems are not at the extreme tail of difficulty. At the hardest problems, the large model's single-shot reasoning can dominate because no amount of sampling from the weak model covers the correct solution.

The lesson is not "sampling always wins." It is that training compute and inference compute are two independent knobs, and Chinchilla only pins down the first. The optimal serving configuration depends on query distribution, verifier quality, and latency constraints.

Watch Out

Test-time scaling does not repeal training scaling

Reasoning models still benefit from larger base models. The RL-over-reasoning regime composes with pretraining scale rather than replacing it. A small model with a huge inference budget has a ceiling set by what its policy can represent. Test-time compute shifts the Pareto frontier of (training FLOPs, inference FLOPs, accuracy), it does not collapse it to a single axis.

Watch Out

Coverage is not accuracy

pass@NN measures whether any sample is correct. Best-of-NN accuracy measures whether the selected sample is correct. The two can diverge sharply when the verifier is weak, the reward model is miscalibrated, or the task has many plausible-looking wrong answers. Reporting pass@NN without a selector overstates deployable performance.

Common Confusions

Watch Out

Chinchilla-optimal is not always practically optimal

Chinchilla minimizes loss per training FLOP. But if you will serve the model to billions of requests, inference cost (which scales with NN) dominates total cost. The practical optimum trains a smaller model on more data than Chinchilla suggests. Llama 3 70B was trained on 15T tokens (roughly 214 tokens per parameter), far beyond the Chinchilla ratio of 20. This is intentional: over-training reduces NN for a given quality level, cutting inference costs.

Watch Out

Power laws do not mean linear improvement

A power law LN0.076L \propto N^{-0.076} means you need approximately 10x more parameters for each 15% reduction in loss. Improving from 3.0 to 2.5 nats requires roughly 100x scale. Improving from 2.5 to 2.0 nats requires another 100x. The returns are diminishing in absolute terms, though constant in relative (percentage) terms. This is why training frontier models costs hundreds of millions of dollars for incremental improvements.

Watch Out

Scaling exponents are not universal constants

The exponents α\alpha, β\beta depend on the architecture, data distribution, and tokenizer. Different studies report different values. The qualitative result, power-law scaling with equal allocation of compute, is robust, but the exact exponents should not be treated as fundamental constants of nature.

Summary

  • Loss scales as power laws in NN, DD, and CC: LNαL \propto N^{-\alpha}, LDβL \propto D^{-\beta}.
  • Kaplan (2020): favor large models with moderate data; the under-converged smaller models in the fit biased the conclusion toward NN.
  • Chinchilla (2022): scale NN and DD equally with compute; D/N20D^*/N^* \approx 20 tokens per parameter for the fitted exponents.
  • Compute estimate: C6NDC \approx 6ND FLOPs for training; the Lagrangian gives NCβ/(α+β)N^* \propto C^{\beta/(\alpha+\beta)} for any fitted (α,β)(\alpha, \beta).
  • Decomposed form: L=E+A/Nα+B/DβL = E + A/N^\alpha + B/D^\beta with irreducible entropy EE.
  • Data-constrained scaling (Muennighoff 2023): up to 4\sim 4 epochs of repetition behave like fresh data; effective tokens saturate exponentially with half-life RD15R_D^{*} \approx 15.
  • Metric-induced emergence (Schaeffer 2023): sharp downstream "emergence" curves are mostly artifacts of exact-match scoring of length-LL outputs; smooth metrics recover smooth scaling on the same model families.
  • Inference cost favors over-training smaller models beyond Chinchilla-optimal; knowledge distillation offers another path to efficient smaller models.
  • Test-time compute is a second scaling axis: best-of-NN and RL-trained reasoning traces can substitute inference FLOPs for parameters on reasoning tasks (Snell et al. 2024, Brown et al. 2024).

Exercises

ExerciseCore

Problem

You have a compute budget of C=1023C = 10^{23} FLOPs. Using the Chinchilla-optimal ratio of 20 tokens per parameter and the C6NDC \approx 6ND rule, what are the optimal model size NN^* and data size DD^*?

ExerciseAdvanced

Problem

Suppose the loss function is L(N,D)=E+A/Nα+B/DβL(N, D) = E + A/N^\alpha + B/D^\beta with E=1.69E = 1.69, A=406.4A = 406.4, B=410.7B = 410.7, α=0.34\alpha = 0.34, β=0.28\beta = 0.28. Subject to the constraint C=6NDC = 6ND, use Lagrange multipliers to show that the optimal allocation satisfies NCaN^* \propto C^{a} and find aa in terms of α\alpha and β\beta.

ExerciseAdvanced

Problem

Suppose you have UD=200U_D = 200B unique tokens and a compute budget of C=6×1023C = 6 \times 10^{23} FLOPs. The fresh-data Chinchilla recipe asks for D=20ND^* = 20 N^*. Using the Muennighoff data-constrained scaling law with RD=15R_D^{*} = 15, qualitatively describe how the optimal NN shifts when you must reuse data, and compute the rough number of epochs you would run if you set NN to the fresh-data Chinchilla value but had only 200200B unique tokens.

ExerciseResearch

Problem

The emergent abilities debate centers on whether discontinuities in benchmark performance are real or artifacts of the evaluation metric. Using the Schaeffer-Miranda-Koyejo (1cNα)L(1 - c N^{-\alpha})^L model, derive the location of the apparent inflection point in NN as a function of task length LL. Then design a controlled experiment that distinguishes metric-induced emergence from a genuine capability threshold.

Related Comparisons

Frequently Asked Questions

What does Chinchilla-optimal mean?
Hoffmann et al. (2022) ran a finer-grained scaling sweep than Kaplan et al. (2020) and found compute-optimal training requires roughly 20 tokens per parameter. Pre-Chinchilla flagship models (GPT-3 included) were drastically undertrained: at the same compute, halving parameters and doubling tokens yields lower loss.
Why did Kaplan and Chinchilla disagree?
Kaplan's experimental design fixed a learning-rate schedule that did not scale appropriately with training horizon, biasing the conclusion toward parameter scaling. Chinchilla varied both N and D simultaneously with proper schedule scaling and found the loss surface had a different shape. The disagreement was experimental design, not a deeper theoretical conflict.
Are emergent abilities real?
Schaeffer et al. (2023) showed many 'emergent' phase transitions disappear under smooth metrics: nonlinear scoring (exact-match accuracy on a long answer) created the appearance of discontinuity, while smooth metrics (per-token log-prob) showed continuous scaling. Some behaviors (chain-of-thought helping at all, code generation past a quality threshold) still appear meaningfully discontinuous to users. The phenomenon is partly a measurement artifact, partly real.
What changes when training data is constrained?
Muennighoff et al. (2023): training on repeated tokens has diminishing but real returns. Up to roughly 4-5 epochs of repetition matches unique-token training; beyond that, returns flatten but stay positive for several more epochs. The 'data wall' as a hard cliff is too pessimistic; the wall is a gradient with non-trivial slope.
Do scaling laws apply to MoE models?
Yes, with extensions. Active parameters drive loss similar to dense scaling; total parameters add a separate axis with smaller exponent (sparser models capture less of the dense return). Krajewski et al. (2024), DeepSeekMoE, and the DeepSeek-V3 technical report extend the dense Chinchilla picture to sparse MoE regimes with explicit active-parameter and total-parameter tradeoffs.

References

Core neural scaling:

  • Kaplan et al., Scaling Laws for Neural Language Models (2020). arXiv:2001.08361
  • Hoffmann et al., Training Compute-Optimal Large Language Models (Chinchilla, 2022). arXiv:2203.15556
  • Henighan et al., Scaling Laws for Autoregressive Generative Modeling (2020). arXiv:2010.14701

Data limits and repeated data:

  • Muennighoff et al., Scaling Data-Constrained Language Models (NeurIPS 2023). arXiv:2305.16264
  • Villalobos et al., Will We Run Out of Data? An Analysis of the Limits of Scaling Datasets in Machine Learning (2022). arXiv:2211.04325
  • Hernandez et al., Scaling Laws and Interpretability of Learning from Repeated Data (2022). arXiv:2205.10487

Emergence and evaluation:

  • Wei et al., Emergent Abilities of Large Language Models (TMLR 2022). arXiv:2206.07682
  • Schaeffer, Miranda, Koyejo, Are Emergent Abilities of Large Language Models a Mirage? (NeurIPS 2023). arXiv:2304.15004
  • Srivastava et al., Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models (2022). arXiv:2206.04615

MoE, hyperparameter transfer, and test-time compute:

  • Krajewski et al., Scaling Laws for Fine-Grained Mixture of Experts (2024). arXiv:2402.07871
  • Dai et al., DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models (2024). arXiv:2401.06066
  • DeepSeek-AI, DeepSeek-V3 Technical Report (2024). arXiv:2412.19437
  • Yang et al., Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer (muP, 2022). arXiv:2203.03466
  • Snell et al., Scaling LLM Test-Time Compute Optimally (2024). arXiv:2408.03314
  • Brown et al., Large Language Monkeys: Scaling Inference Compute with Repeated Sampling (2024). arXiv:2407.21787

Next Topics

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

8

Derived topics

7

+2 more on the derived-topics page.