Skip to main content

Foundations

Numerical Stability and Conditioning

Continuous math becomes real only through finite-precision approximation. Condition numbers, backward stability, catastrophic cancellation, and why theorems about reals do not transfer cleanly to floating-point.

CoreTier 1StableCore spine~45 min

Why This Matters

Every computation in ML runs on finite-precision hardware. Gradients are computed in float32 or float16. Matrix multiplications accumulate rounding errors. Softmax overflows without careful implementation. Loss functions produce NaN when the logarithm receives a zero.

The theorems you learn in analysis assume exact real arithmetic. Numerical stability theory tells you what happens when those assumptions fail. A numerically unstable algorithm can produce garbage output even when the underlying mathematical problem is well-posed. A poorly conditioned problem amplifies input perturbations regardless of the algorithm.

Understanding the distinction between conditioning (a property of the problem) and stability (a property of the algorithm) is required for debugging training failures, choosing implementations, and understanding why tricks like log-sum-exp, BatchNorm, residual connections, and attention scaling exist.

theorem visual

Conditioning versus algorithmic stability

A bad answer can come from a sensitive problem, an unstable algorithm, or both. The fix depends on which layer is failing.

input perturbationsmall backward errorproblem amplificationamplified by condition numbercomputed answeroutput errorcancellationlarge termstiny differencerewrite formula

problem layer

Conditioning asks how much the exact map amplifies input perturbations.

algorithm layer

Backward stability asks whether is the exact answer to a nearby input .

failure mode

Catastrophic cancellation can make return zero when the stable form is .

Prerequisites

This page assumes familiarity with floating-point arithmetic (IEEE 754, machine epsilon, rounding) and matrix operations (norms, eigenvalues, singular values).

Core Definitions

Definition

Forward Error

The forward error of a computation is the difference between the computed result y^\hat{y} and the true result yy:

forward error=y^y\text{forward error} = \|\hat{y} - y\|

Relative forward error: y^y/y\|\hat{y} - y\| / \|y\|.

Definition

Backward Error

The backward error of a computation asks: for what perturbed input x^\hat{x} does the exact algorithm ff produce the computed output y^\hat{y}? That is, find the smallest Δx\Delta x such that:

y^=f(x+Δx)\hat{y} = f(x + \Delta x)

The backward error is Δx/x\|\Delta x\| / \|x\|. A small backward error means the computed answer is the exact answer to a nearby problem.

Definition

Condition Number

The condition number of a problem measures how sensitive the output is to small perturbations in the input. For a differentiable function ff at input xx:

κ=supΔxϵf(x+Δx)f(x)/f(x)Δx/x\kappa = \sup_{\|\Delta x\| \leq \epsilon} \frac{\|f(x + \Delta x) - f(x)\| / \|f(x)\|}{\|\Delta x\| / \|x\|}

For a matrix AA, the condition number with respect to solving Ax=bAx = b is:

κ(A)=AA1=σmax(A)σmin(A)\kappa(A) = \|A\| \cdot \|A^{-1}\| = \frac{\sigma_{\max}(A)}{\sigma_{\min}(A)}

where σmax\sigma_{\max} and σmin\sigma_{\min} are the largest and smallest singular values. A large condition number means the problem is ill-conditioned: small input changes produce large output changes.

Definition

Numerical Stability

"Numerical stability" is an umbrella term for several distinct, increasingly demanding conditions; modern usage (Higham, Accuracy and Stability of Numerical Algorithms, Ch. 1) distinguishes:

  • Forward stable: the computed y^\hat y is close to the true y=f(x)y = f(x), with relative forward error of order κ(x)ϵmach\kappa(x) \cdot \epsilon_{\text{mach}} — i.e. as accurate as the conditioning permits.
  • Mixed forward-backward stable: y^=f(x^)+Δy\hat y = f(\hat x) + \Delta y with both x^x/x\|\hat x - x\|/\|x\| and Δy/y\|\Delta y\|/\|y\| of order ϵmach\epsilon_{\text{mach}}.
  • Backward stable: y^=f(x^)\hat y = f(\hat x) exactly for some x^\hat x with x^x/x=O(ϵmach)\|\hat x - x\|/\|x\| = O(\epsilon_{\text{mach}}) — the computed result is the exact answer to a nearby problem.

y^=f(x+Δx),Δx/x=O(ϵmach)\hat{y} = f(x + \Delta x), \quad \|\Delta x\| / \|x\| = O(\epsilon_{\text{mach}})

Backward stability ⟹ mixed forward-backward stability ⟹ forward stability (relative to conditioning), but the converses fail in general. Backward stability is the gold standard because (a) it implies the others and (b) the actual error is then bounded purely by the condition number κ\kappa via forward errorκbackward error\text{forward error} \leq \kappa \cdot \text{backward error}. Loose use of "stable" without qualification typically means backward stable in numerical linear algebra and forward stable in optimization.

The Central Relationship

The forward error is bounded by the product of condition number and backward error:

y^yyκΔxx\frac{\|\hat{y} - y\|}{\|y\|} \leq \kappa \cdot \frac{\|\Delta x\|}{\|x\|}

This means:

  • Well-conditioned + backward stable = accurate result. Small backward error, small amplification.
  • Ill-conditioned + backward stable = large forward error, but the best you can do. The problem itself is sensitive; no algorithm can avoid the amplification.
  • Well-conditioned + unstable = avoidable inaccuracy. Switch to a better algorithm.
  • Ill-conditioned + unstable = disaster. Both the problem and the algorithm contribute to error.

Main Theorems

Theorem

Backward Stability of Gaussian Elimination with Partial Pivoting

Statement

Gaussian elimination with partial pivoting computes the solution x^\hat{x} of Ax=bAx = b such that:

(A+ΔA)x^=b,ΔAg(n)uA(A + \Delta A)\hat{x} = b, \quad \|\Delta A\| \leq g(n) \cdot u \cdot \|A\|

where uu is the unit roundoff and g(n)g(n) is a growth factor. For partial pivoting, g(n)2n1g(n) \leq 2^{n-1} in the worst case. This exponential bound is pessimistic for most practical matrices, but the theorem's honest guarantee is still growth-factor dependent: pivoting makes Gaussian elimination reliable in ordinary use, not magically immune to pathological examples.

The forward error satisfies:

x^xxκ(A)g(n)u\frac{\|\hat{x} - x\|}{\|x\|} \leq \kappa(A) \cdot g(n) \cdot u

Intuition

Gaussian elimination with partial pivoting is backward stable: the computed solution x^\hat{x} is the exact solution of a slightly perturbed system (A+ΔA)x^=b(A + \Delta A)\hat{x} = b. The perturbation is small (on the order of machine epsilon times the matrix norm). The forward error is this backward error amplified by the condition number κ(A)\kappa(A). If AA is well-conditioned, the result is accurate. If AA is ill-conditioned, the error can be large, but no direct method can do better.

Proof Sketch

Track the rounding errors through the elimination process. Each elementary row operation introduces a relative error of O(u)O(u). Partial pivoting ensures that the multipliers are bounded by 1, preventing individual steps from amplifying errors excessively. The accumulated error after O(n2)O(n^2) operations is bounded by g(n)uAg(n) u \|A\|, where the growth factor g(n)g(n) accounts for the worst-case accumulation through the pivot sequence.

The full proof appears in Higham, Accuracy and Stability of Numerical Algorithms, Chapter 9.

Why It Matters

This theorem explains why Gaussian elimination works well in practice despite theoretical worst-case growth factors of 2n12^{n-1}. It also explains why condition number matters: even with a perfect algorithm, an ill-conditioned matrix produces large forward error. In ML, condition numbers arise everywhere: the Hessian condition number determines gradient descent convergence speed, the condition number of the data matrix affects linear regression stability, and the condition number of attention logits affects softmax precision.

Failure Mode

Without pivoting, Gaussian elimination can be catastrophically unstable even for well-conditioned matrices. The growth factor can be exponential. Example: the matrix (ϵ111)\begin{pmatrix} \epsilon & 1 \\ 1 & 1 \end{pmatrix} for small ϵ\epsilon produces enormous multipliers (1/ϵ1/\epsilon) without pivoting, leading to complete loss of precision.

Partial pivoting fails in rare pathological cases (Wilkinson matrices) where g(n)=2n1g(n) = 2^{n-1}. Complete pivoting reduces g(n)g(n) further but is more expensive.

Proposition

Catastrophic Cancellation

Statement

Let aa and bb be floating-point numbers with aba \approx b. The relative error in computing c^=fl(ab)\hat{c} = \text{fl}(a - b) can be as large as:

c^(ab)ab2umax(a,b)ab\frac{|\hat{c} - (a - b)|}{|a - b|} \leq \frac{2u \cdot \max(|a|, |b|)}{|a - b|}

When aba|a - b| \ll |a|, this relative error is much larger than uu. In the extreme, when aa and bb agree in kk leading digits, the result aba - b has only pkp - k significant digits (where pp is the total precision).

Intuition

Subtraction of nearly equal numbers cancels the leading significant digits, leaving only the noisy trailing digits. The absolute error in c^\hat{c} is no worse than before (bounded by umax(a,b)u \cdot \max(|a|, |b|)), but the result ab|a - b| is tiny, so the relative error explodes.

Example: in 7-digit arithmetic, 1.2345671.234566=0.0000011.234567 - 1.234566 = 0.000001. But if the last digits were rounded, the result might be 0.0000020.000002 or 0.0000000.000000, giving 100% relative error.

Proof Sketch

The absolute error in aba - b is bounded by u(a+b)2umax(a,b)u(|a| + |b|) \leq 2u \max(|a|, |b|) (from the rounding errors already present in aa and bb). Dividing by ab|a - b| gives the relative error bound. When ab|a - b| is small compared to a|a|, the ratio is large.

Why It Matters

Catastrophic cancellation explains many numerical failures in ML:

  • Computing Var(X)=E[X2](E[X])2\text{Var}(X) = E[X^2] - (E[X])^2 fails when the variance is small relative to the mean, because both terms are large and nearly equal. The stable alternative: compute the variance using the two-pass algorithm or Welford's online algorithm.
  • Softmax: exp(xi)/jexp(xj)\exp(x_i) / \sum_j \exp(x_j) overflows when xix_i is large. The fix: subtract maxjxj\max_j x_j from all entries. This is numerically equivalent but avoids overflow.
  • LogSumExp: logjexp(xj)\log \sum_j \exp(x_j) requires the same trick to avoid overflow.

Backward vs Forward Error Analysis

The Wilkinson framework, developed by J.H. Wilkinson in the 1960s, provides a systematic way to analyze numerical algorithms. The key insight is to ask not "how wrong is the output?" but "for what exact problem is the output the correct answer?"

Forward error analysis tracks how rounding errors accumulate through each operation and bounds the total deviation of the computed result from the true result. It is direct but tends to be pessimistic: errors are assumed to compound in the worst direction at every step, which rarely happens in practice.

Backward error analysis takes the computed output y^\hat{y} and asks: what is the smallest perturbation δx\delta x to the input xx such that f(x+δx)=y^f(x + \delta x) = \hat{y} exactly? The backward error is δx/x\|\delta x\| / \|x\|. An algorithm is backward stable if and only if this ratio is O(ϵmach)O(\epsilon_{\text{mach}}) for all inputs.

The central theorem connecting the two:

forward errorκbackward error\text{forward error} \leq \kappa \cdot \text{backward error}

where κ\kappa is the condition number of the problem. This decomposes error into two orthogonal contributions:

  • Problem sensitivity (κ\kappa): intrinsic, cannot be improved by changing the algorithm.
  • Algorithm quality (backward error): can be improved by choosing a better algorithm.

A backward stable algorithm achieves backward error of O(ϵmach)O(\epsilon_{\text{mach}}). It "does as well as the problem allows"; the only remaining error is from conditioning.

Wilkinson's specific contributions include the backward-error framework for Gaussian elimination and the growth-factor analysis that explains why partial pivoting is usually reliable despite rare worst cases. The growth factor g(n)=maxkA(k)/Ag(n) = \max_{k} \|A^{(k)}\|_\infty / \|A\|_\infty (where A(k)A^{(k)} is the matrix after kk elimination steps) is the key quantity bounding backward error.

In floating-point arithmetic, every elementary operation fl(ab)=(ab)(1+δ)\text{fl}(a \circ b) = (a \circ b)(1 + \delta) with δϵmach|\delta| \leq \epsilon_{\text{mach}}. A backward error analysis chains these single-step bounds through the entire computation. For matrix-vector multiplication y=Axy = Ax with an n×nn \times n matrix, the computed y^\hat{y} can be written as

y^=(A+ΔA)x,ΔAijnϵmachAij.\hat{y}=(A+\Delta A)x,\qquad |\Delta A_{ij}| \lesssim n\epsilon_{\text{mach}}|A_{ij}|.

So the matrix-vector product is backward stable with respect to the matrix entries. The relative forward error is controlled by the condition number of the multiplication problem, roughly Ax/Ax\|A\|\,\|x\|/\|Ax\|, not by the linear-solve condition number AA1\|A\|\|A^{-1}\| unless you are actually solving Ax=bAx=b.

The practical implication for ML: when a training run produces NaN or wildly large gradients, the question is whether the problem is conditioning (the Hessian eigenvalue ratio is extreme) or stability (the algorithm is amplifying errors beyond what conditioning requires). Backward error analysis distinguishes the two.

Key Stability Tricks in ML

Log-Sum-Exp

The naive computation of logj=1nexp(xj)\log \sum_{j=1}^n \exp(x_j) overflows when any xj>709x_j > 709 (for float64) or xj>88x_j > 88 (for float32). The stable version:

LogSumExp(x)=m+logj=1nexp(xjm),m=maxjxj\text{LogSumExp}(x) = m + \log \sum_{j=1}^n \exp(x_j - m), \quad m = \max_j x_j

Subtracting mm ensures all exponents are 0\leq 0, so exp(xjm)(0,1]\exp(x_j - m) \in (0, 1]. This is mathematically identical but numerically safe.

Softmax Stability

Similarly, softmax(x)i=exp(xi)/jexp(xj)\text{softmax}(x)_i = \exp(x_i) / \sum_j \exp(x_j) is computed as:

softmax(x)i=exp(xim)/jexp(xjm)\text{softmax}(x)_i = \exp(x_i - m) / \sum_j \exp(x_j - m)

This prevents overflow in the numerator and denominator.

Variance Computation

The textbook formula Var(X)=E[X2](E[X])2\text{Var}(X) = E[X^2] - (E[X])^2 suffers from catastrophic cancellation when the mean is large and the variance is small. Welford's algorithm computes the variance in a single pass with O(1)O(1) memory and avoids this:

Update running mean xˉn\bar{x}_n and sum of squared differences MnM_n:

  • δ=xnxˉn1\delta = x_n - \bar{x}_{n-1}
  • xˉn=xˉn1+δ/n\bar{x}_n = \bar{x}_{n-1} + \delta / n
  • Mn=Mn1+δ(xnxˉn)M_n = M_{n-1} + \delta (x_n - \bar{x}_n)
  • Var=Mn/(n1)\text{Var} = M_n / (n - 1)

Attention Scaling

In the transformer, attention logits are QKT/dkQK^T / \sqrt{d_k} where dkd_k is the head dimension. For q,kRdkq, k \in \mathbb{R}^{d_k} with i.i.d. unit-variance entries, the dot product qkq \cdot k has variance O(dk)O(d_k) and therefore typical magnitude O(dk)O(\sqrt{d_k}). Without the dk\sqrt{d_k} scaling the logits would grow as dk\sqrt{d_k}, pushing the softmax into its saturated regime where gradients vanish. Dividing by dk\sqrt{d_k} restores unit variance and keeps the logits in a range where softmax has meaningful gradients.

Mixed-Precision Training

Modern ML training uses float16 (or bfloat16) for forward and backward passes and float32 for weight updates. The accumulation in float32 prevents gradient underflow: gradients in float16 can underflow to zero when they are smaller than 2246×1082^{-24} \approx 6 \times 10^{-8}. Loss scaling (multiplying the loss by a large constant, computing gradients, then dividing by the same constant) shifts the gradient distribution into the representable range.

Residual Connections and BatchNorm

Residual connections (y=x+f(x)y = x + f(x)) add a skip path that preserves signal magnitude. Without them, signals in deep networks can shrink exponentially (vanishing gradients) or grow exponentially (exploding gradients). BatchNorm normalizes intermediate activations to zero mean and unit variance. The original paper framed this as reducing internal covariate shift; later work argues the main benefit is often smoother optimization geometry and better gradient conditioning. Either way, the practical lesson is scale control: deep nets train better when activation and gradient magnitudes stay in a numerically useful range.

Stability Debugger for ML

SymptomLikely numerical mechanismBetter response
NaNs in softmax or cross-entropyexponent overflow or log(0)\log(0)Use fused log_softmax / cross-entropy implementations
Negative or zero variance estimatecancellation in E[X2](E[X])2E[X^2]-(E[X])^2Use two-pass variance or Welford updates
Tiny residual but bad solutionill-conditioned linear systemInspect κ(A)\kappa(A), regularize, precondition, or rescale
fp16 gradients become zerounderflowUse loss scaling, bfloat16, or float32 accumulation
Loss spikes after a learning-rate changeunstable discretization / step size too largeReduce step size, add clipping, check normalization

Common Confusions

Watch Out

Conditioning is a property of the problem, not the algorithm

A matrix with condition number 101510^{15} produces large forward errors under ANY algorithm, not just unstable ones. No amount of algorithmic cleverness can compensate for ill-conditioning. The only solution is to reformulate the problem (regularization, preconditioning) or accept the limited precision.

Watch Out

Small residual does not mean accurate solution

For the linear system Ax=bAx = b, the residual r=bAx^r = b - A\hat{x} can be small even when x^\hat{x} is far from the true xx, if AA is ill-conditioned. The bound is xx^/xκ(A)r/b\|x - \hat{x}\| / \|x\| \leq \kappa(A) \cdot \|r\| / \|b\|. With κ(A)=1010\kappa(A) = 10^{10} and r/b=1016\|r\|/\|b\| = 10^{-16} (machine epsilon), the solution error can be 10610^{-6}, which is only 6 digits of accuracy despite a tiny residual.

Watch Out

float16 is not just float32 with less precision

float16 has a much smaller dynamic range (10810^{-8} to 6550465504) compared to float32 (103810^{-38} to 103810^{38}). Underflow and overflow are not just precision issues; they produce zeros and infinities. bfloat16 has the same dynamic range as float32 (same exponent bits) but less precision: 8 vs 24 total significand bits (counting the implicit leading 1), or equivalently 7 vs 23 explicit stored mantissa bits. The choice between float16 and bfloat16 depends on whether precision or dynamic range is more important for the application.

Watch Out

Numerical stability and algorithmic stability are different concepts

Numerical stability concerns rounding errors in finite-precision arithmetic. Algorithmic stability (in the ML sense) concerns how much the learned model changes when one training example is added or removed. Both use the word "stability" but they are different properties. A learning algorithm can be algorithmically stable (small model change per sample) while being numerically unstable (rounding errors corrupt the gradient), or vice versa.

Summary

  • Forward error = condition number times backward error. This decomposition separates the problem's sensitivity from the algorithm's accuracy.
  • Gaussian elimination with partial pivoting is backward stable; the forward error is controlled by the condition number.
  • Catastrophic cancellation occurs when subtracting nearly equal numbers, destroying relative precision.
  • The log-sum-exp trick, softmax centering, and Welford variance are standard defenses against numerical instability.
  • Attention scaling by 1/dk1/\sqrt{d_k} prevents softmax saturation.
  • Mixed-precision training uses float16 for speed and float32 for accumulation to avoid gradient underflow.
  • Conditioning is a problem property; stability is an algorithm property. Know which one is causing your trouble.

Exercises

ExerciseCore

Problem

The matrix A=(1111.0001)A = \begin{pmatrix} 1 & 1 \\ 1 & 1.0001 \end{pmatrix} has trace 2.00012.0001 and determinant 0.00010.0001, giving eigenvalues approximately λ12.00005\lambda_1 \approx 2.00005 and λ20.00005\lambda_2 \approx 0.00005. Compute its condition number κ2(A)=σmax/σmin\kappa_2(A) = \sigma_{\max}/\sigma_{\min}. If you solve Ax=bAx = b with a backward stable algorithm in float64 (u1016u \approx 10^{-16}), what is the approximate worst-case relative error in xx?

ExerciseCore

Problem

Show that computing f(x)=x+1xf(x) = \sqrt{x+1} - \sqrt{x} for large xx suffers from catastrophic cancellation. Derive a mathematically equivalent form that is numerically stable.

ExerciseAdvanced

Problem

In mixed-precision training, gradients are computed in float16 and accumulated in float32. Explain why loss scaling is necessary. If the typical gradient magnitude is 10510^{-5} and float16 has minimum positive normal 6×108\approx 6 \times 10^{-8}, what loss scale factor prevents underflow?

ExerciseAdvanced

Problem

The two-pass variance algorithm computes xˉ=1nxi\bar{x} = \frac{1}{n}\sum x_i first, then s2=1n1(xixˉ)2s^2 = \frac{1}{n-1}\sum (x_i - \bar{x})^2. The one-pass formula s2=1n1(xi2nxˉ2)s^2 = \frac{1}{n-1}(\sum x_i^2 - n\bar{x}^2) uses catastrophic cancellation. Construct a dataset of 4 numbers where the one-pass formula gives a negative variance in 4-digit decimal arithmetic with rounding after each operation, but the two-pass formula gives a positive result.

References

Canonical:

  • Higham, Accuracy and Stability of Numerical Algorithms (2nd ed., 2002), Chapters 1-3 (backward error, condition numbers) and Chapter 9 (Gaussian elimination stability)
  • Trefethen & Bau, Numerical Linear Algebra (1997), Lectures 12-16 (conditioning, backward stability, QR)
  • Wilkinson, Rounding Errors in Algebraic Processes (1963), Chapters 1-2 — the original backward error framework

ML Connections:

  • Micikevicius et al., "Mixed Precision Training" (ICLR 2018), Section 3 (loss scaling, float16 underflow)
  • Vaswani et al., "Attention Is All You Need" (2017), Section 3.2.1 (scaling by dk\sqrt{d_k})
  • Ioffe & Szegedy, "Batch Normalization" (ICML 2015), Section 3 (normalization as numerical stabilization)
  • Santurkar et al., "How Does Batch Normalization Help Optimization?" (NeurIPS 2018), Section 3 (optimization landscape smoothing)

Accessible:

  • Goldberg, "What Every Computer Scientist Should Know About Floating-Point Arithmetic" (1991), Sections 1-3 (rounding, cancellation, IEEE 754)

Next Topics

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

3

Derived topics

1

Graph-backed continuations