Foundations
Numerical Stability and Conditioning
Continuous math becomes real only through finite-precision approximation. Condition numbers, backward stability, catastrophic cancellation, and why theorems about reals do not transfer cleanly to floating-point.
Why This Matters
Every computation in ML runs on finite-precision hardware. Gradients are computed in float32 or float16. Matrix multiplications accumulate rounding errors. Softmax overflows without careful implementation. Loss functions produce NaN when the logarithm receives a zero.
The theorems you learn in analysis assume exact real arithmetic. Numerical stability theory tells you what happens when those assumptions fail. A numerically unstable algorithm can produce garbage output even when the underlying mathematical problem is well-posed. A poorly conditioned problem amplifies input perturbations regardless of the algorithm.
Understanding the distinction between conditioning (a property of the problem) and stability (a property of the algorithm) is required for debugging training failures, choosing implementations, and understanding why tricks like log-sum-exp, BatchNorm, residual connections, and attention scaling exist.
Prerequisites
This page assumes familiarity with floating-point arithmetic (IEEE 754, machine epsilon, rounding) and matrix operations (norms, eigenvalues, singular values).
Core Definitions
Forward Error
The forward error of a computation is the difference between the computed result and the true result :
Relative forward error: .
Backward Error
The backward error of a computation asks: for what perturbed input does the exact algorithm produce the computed output ? That is, find the smallest such that:
The backward error is . A small backward error means the computed answer is the exact answer to a nearby problem.
Condition Number
The condition number of a problem measures how sensitive the output is to small perturbations in the input. For a differentiable function at input :
For a matrix , the condition number with respect to solving is:
where and are the largest and smallest singular values. A large condition number means the problem is ill-conditioned: small input changes produce large output changes.
Numerical Stability
"Numerical stability" is an umbrella term for several distinct, increasingly demanding conditions; modern usage (Higham, Accuracy and Stability of Numerical Algorithms, Ch. 1) distinguishes:
- Forward stable: the computed is close to the true , with relative forward error of order — i.e. as accurate as the conditioning permits.
- Mixed forward-backward stable: with both and of order .
- Backward stable: exactly for some with — the computed result is the exact answer to a nearby problem.
Backward stability ⟹ mixed forward-backward stability ⟹ forward stability (relative to conditioning), but the converses fail in general. Backward stability is the gold standard because (a) it implies the others and (b) the actual error is then bounded purely by the condition number via . Loose use of "stable" without qualification typically means backward stable in numerical linear algebra and forward stable in optimization.
The Central Relationship
The forward error is bounded by the product of condition number and backward error:
This means:
- Well-conditioned + backward stable = accurate result. Small backward error, small amplification.
- Ill-conditioned + backward stable = large forward error, but the best you can do. The problem itself is sensitive; no algorithm can avoid the amplification.
- Well-conditioned + unstable = avoidable inaccuracy. Switch to a better algorithm.
- Ill-conditioned + unstable = disaster. Both the problem and the algorithm contribute to error.
Main Theorems
Backward Stability of Gaussian Elimination with Partial Pivoting
Statement
Gaussian elimination with partial pivoting computes the solution of such that:
where is the unit roundoff and is a growth factor. For partial pivoting, in the worst case. This exponential bound is pessimistic for most practical matrices, but the theorem's honest guarantee is still growth-factor dependent: pivoting makes Gaussian elimination reliable in ordinary use, not magically immune to pathological examples.
The forward error satisfies:
Intuition
Gaussian elimination with partial pivoting is backward stable: the computed solution is the exact solution of a slightly perturbed system . The perturbation is small (on the order of machine epsilon times the matrix norm). The forward error is this backward error amplified by the condition number . If is well-conditioned, the result is accurate. If is ill-conditioned, the error can be large, but no direct method can do better.
Proof Sketch
Track the rounding errors through the elimination process. Each elementary row operation introduces a relative error of . Partial pivoting ensures that the multipliers are bounded by 1, preventing individual steps from amplifying errors excessively. The accumulated error after operations is bounded by , where the growth factor accounts for the worst-case accumulation through the pivot sequence.
The full proof appears in Higham, Accuracy and Stability of Numerical Algorithms, Chapter 9.
Why It Matters
This theorem explains why Gaussian elimination works well in practice despite theoretical worst-case growth factors of . It also explains why condition number matters: even with a perfect algorithm, an ill-conditioned matrix produces large forward error. In ML, condition numbers arise everywhere: the Hessian condition number determines gradient descent convergence speed, the condition number of the data matrix affects linear regression stability, and the condition number of attention logits affects softmax precision.
Failure Mode
Without pivoting, Gaussian elimination can be catastrophically unstable even for well-conditioned matrices. The growth factor can be exponential. Example: the matrix for small produces enormous multipliers () without pivoting, leading to complete loss of precision.
Partial pivoting fails in rare pathological cases (Wilkinson matrices) where . Complete pivoting reduces further but is more expensive.
Catastrophic Cancellation
Statement
Let and be floating-point numbers with . The relative error in computing can be as large as:
When , this relative error is much larger than . In the extreme, when and agree in leading digits, the result has only significant digits (where is the total precision).
Intuition
Subtraction of nearly equal numbers cancels the leading significant digits, leaving only the noisy trailing digits. The absolute error in is no worse than before (bounded by ), but the result is tiny, so the relative error explodes.
Example: in 7-digit arithmetic, . But if the last digits were rounded, the result might be or , giving 100% relative error.
Proof Sketch
The absolute error in is bounded by (from the rounding errors already present in and ). Dividing by gives the relative error bound. When is small compared to , the ratio is large.
Why It Matters
Catastrophic cancellation explains many numerical failures in ML:
- Computing fails when the variance is small relative to the mean, because both terms are large and nearly equal. The stable alternative: compute the variance using the two-pass algorithm or Welford's online algorithm.
- Softmax: overflows when is large. The fix: subtract from all entries. This is numerically equivalent but avoids overflow.
- LogSumExp: requires the same trick to avoid overflow.
Backward vs Forward Error Analysis
The Wilkinson framework, developed by J.H. Wilkinson in the 1960s, provides a systematic way to analyze numerical algorithms. The key insight is to ask not "how wrong is the output?" but "for what exact problem is the output the correct answer?"
Forward error analysis tracks how rounding errors accumulate through each operation and bounds the total deviation of the computed result from the true result. It is direct but tends to be pessimistic: errors are assumed to compound in the worst direction at every step, which rarely happens in practice.
Backward error analysis takes the computed output and asks: what is the smallest perturbation to the input such that exactly? The backward error is . An algorithm is backward stable if and only if this ratio is for all inputs.
The central theorem connecting the two:
where is the condition number of the problem. This decomposes error into two orthogonal contributions:
- Problem sensitivity (): intrinsic, cannot be improved by changing the algorithm.
- Algorithm quality (backward error): can be improved by choosing a better algorithm.
A backward stable algorithm achieves backward error of . It "does as well as the problem allows"; the only remaining error is from conditioning.
Wilkinson's specific contributions include the backward-error framework for Gaussian elimination and the growth-factor analysis that explains why partial pivoting is usually reliable despite rare worst cases. The growth factor (where is the matrix after elimination steps) is the key quantity bounding backward error.
In floating-point arithmetic, every elementary operation with . A backward error analysis chains these single-step bounds through the entire computation. For matrix-vector multiplication with an matrix, the computed can be written as
So the matrix-vector product is backward stable with respect to the matrix entries. The relative forward error is controlled by the condition number of the multiplication problem, roughly , not by the linear-solve condition number unless you are actually solving .
The practical implication for ML: when a training run produces NaN or wildly large gradients, the question is whether the problem is conditioning (the Hessian eigenvalue ratio is extreme) or stability (the algorithm is amplifying errors beyond what conditioning requires). Backward error analysis distinguishes the two.
Key Stability Tricks in ML
Log-Sum-Exp
The naive computation of overflows when any (for float64) or (for float32). The stable version:
Subtracting ensures all exponents are , so . This is mathematically identical but numerically safe.
Softmax Stability
Similarly, is computed as:
This prevents overflow in the numerator and denominator.
Variance Computation
The textbook formula suffers from catastrophic cancellation when the mean is large and the variance is small. Welford's algorithm computes the variance in a single pass with memory and avoids this:
Update running mean and sum of squared differences :
Attention Scaling
In the transformer, attention logits are where is the head dimension. For with i.i.d. unit-variance entries, the dot product has variance and therefore typical magnitude . Without the scaling the logits would grow as , pushing the softmax into its saturated regime where gradients vanish. Dividing by restores unit variance and keeps the logits in a range where softmax has meaningful gradients.
Mixed-Precision Training
Modern ML training uses float16 (or bfloat16) for forward and backward passes and float32 for weight updates. The accumulation in float32 prevents gradient underflow: gradients in float16 can underflow to zero when they are smaller than . Loss scaling (multiplying the loss by a large constant, computing gradients, then dividing by the same constant) shifts the gradient distribution into the representable range.
Residual Connections and BatchNorm
Residual connections () add a skip path that preserves signal magnitude. Without them, signals in deep networks can shrink exponentially (vanishing gradients) or grow exponentially (exploding gradients). BatchNorm normalizes intermediate activations to zero mean and unit variance. The original paper framed this as reducing internal covariate shift; later work argues the main benefit is often smoother optimization geometry and better gradient conditioning. Either way, the practical lesson is scale control: deep nets train better when activation and gradient magnitudes stay in a numerically useful range.
Stability Debugger for ML
| Symptom | Likely numerical mechanism | Better response |
|---|---|---|
| NaNs in softmax or cross-entropy | exponent overflow or | Use fused log_softmax / cross-entropy implementations |
| Negative or zero variance estimate | cancellation in | Use two-pass variance or Welford updates |
| Tiny residual but bad solution | ill-conditioned linear system | Inspect , regularize, precondition, or rescale |
| fp16 gradients become zero | underflow | Use loss scaling, bfloat16, or float32 accumulation |
| Loss spikes after a learning-rate change | unstable discretization / step size too large | Reduce step size, add clipping, check normalization |
Common Confusions
Conditioning is a property of the problem, not the algorithm
A matrix with condition number produces large forward errors under ANY algorithm, not just unstable ones. No amount of algorithmic cleverness can compensate for ill-conditioning. The only solution is to reformulate the problem (regularization, preconditioning) or accept the limited precision.
Small residual does not mean accurate solution
For the linear system , the residual can be small even when is far from the true , if is ill-conditioned. The bound is . With and (machine epsilon), the solution error can be , which is only 6 digits of accuracy despite a tiny residual.
float16 is not just float32 with less precision
float16 has a much smaller dynamic range ( to ) compared to float32 ( to ). Underflow and overflow are not just precision issues; they produce zeros and infinities. bfloat16 has the same dynamic range as float32 (same exponent bits) but less precision: 8 vs 24 total significand bits (counting the implicit leading 1), or equivalently 7 vs 23 explicit stored mantissa bits. The choice between float16 and bfloat16 depends on whether precision or dynamic range is more important for the application.
Numerical stability and algorithmic stability are different concepts
Numerical stability concerns rounding errors in finite-precision arithmetic. Algorithmic stability (in the ML sense) concerns how much the learned model changes when one training example is added or removed. Both use the word "stability" but they are different properties. A learning algorithm can be algorithmically stable (small model change per sample) while being numerically unstable (rounding errors corrupt the gradient), or vice versa.
Summary
- Forward error = condition number times backward error. This decomposition separates the problem's sensitivity from the algorithm's accuracy.
- Gaussian elimination with partial pivoting is backward stable; the forward error is controlled by the condition number.
- Catastrophic cancellation occurs when subtracting nearly equal numbers, destroying relative precision.
- The log-sum-exp trick, softmax centering, and Welford variance are standard defenses against numerical instability.
- Attention scaling by prevents softmax saturation.
- Mixed-precision training uses float16 for speed and float32 for accumulation to avoid gradient underflow.
- Conditioning is a problem property; stability is an algorithm property. Know which one is causing your trouble.
Exercises
Problem
The matrix has trace and determinant , giving eigenvalues approximately and . Compute its condition number . If you solve with a backward stable algorithm in float64 (), what is the approximate worst-case relative error in ?
Problem
Show that computing for large suffers from catastrophic cancellation. Derive a mathematically equivalent form that is numerically stable.
Problem
In mixed-precision training, gradients are computed in float16 and accumulated in float32. Explain why loss scaling is necessary. If the typical gradient magnitude is and float16 has minimum positive normal , what loss scale factor prevents underflow?
Problem
The two-pass variance algorithm computes first, then . The one-pass formula uses catastrophic cancellation. Construct a dataset of 4 numbers where the one-pass formula gives a negative variance in 4-digit decimal arithmetic with rounding after each operation, but the two-pass formula gives a positive result.
References
Canonical:
- Higham, Accuracy and Stability of Numerical Algorithms (2nd ed., 2002), Chapters 1-3 (backward error, condition numbers) and Chapter 9 (Gaussian elimination stability)
- Trefethen & Bau, Numerical Linear Algebra (1997), Lectures 12-16 (conditioning, backward stability, QR)
- Wilkinson, Rounding Errors in Algebraic Processes (1963), Chapters 1-2 — the original backward error framework
ML Connections:
- Micikevicius et al., "Mixed Precision Training" (ICLR 2018), Section 3 (loss scaling, float16 underflow)
- Vaswani et al., "Attention Is All You Need" (2017), Section 3.2.1 (scaling by )
- Ioffe & Szegedy, "Batch Normalization" (ICML 2015), Section 3 (normalization as numerical stabilization)
- Santurkar et al., "How Does Batch Normalization Help Optimization?" (NeurIPS 2018), Section 3 (optimization landscape smoothing)
Accessible:
- Goldberg, "What Every Computer Scientist Should Know About Floating-Point Arithmetic" (1991), Sections 1-3 (rounding, cancellation, IEEE 754)
Next Topics
- Conditioning and condition number: deeper treatment of condition numbers for specific problems in optimization and linear algebra
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
3- Floating-Point Arithmeticlayer 0A · tier 1
- Matrix Normslayer 0A · tier 1
- Matrix Operations and Propertieslayer 0A · tier 1
Derived topics
1- Conditioning and Condition Numberlayer 1 · tier 1
Graph-backed continuations