Numerical Stability and Conditioning

Sneiderman, Robby

Foundations

Numerical Stability and Conditioning

Continuous math becomes real only through finite-precision approximation. Condition numbers, backward stability, catastrophic cancellation, and why theorems about reals do not transfer cleanly to floating-point.

CoreTier 1StableCore spine~45 min

Prerequisites

Floating Point Arithmetic Matrix Operations and Properties Matrix Norms

Quiz (4)Prereq Map

Why This Matters

Every computation in ML runs on finite-precision hardware. Gradients are computed in float32 or float16. Matrix multiplications accumulate rounding errors. Softmax overflows without careful implementation. Loss functions produce NaN when the logarithm receives a zero.

The theorems you learn in analysis assume exact real arithmetic. Numerical stability theory tells you what happens when those assumptions fail. A numerically unstable algorithm can produce garbage output even when the underlying mathematical problem is well-posed. A poorly conditioned problem amplifies input perturbations regardless of the algorithm.

Understanding the distinction between conditioning (a property of the problem) and stability (a property of the algorithm) is required for debugging training failures, choosing implementations, and understanding why tricks like log-sum-exp, BatchNorm, residual connections, and attention scaling exist.

theorem visual

Conditioning versus algorithmic stability

$A bad answer can come from a sensitive problem, an unstable algorithm, or both. The fix depends on which layer is failing.$

problem layer

Conditioning asks how much the exact map f amplifies input perturbations.

algorithm layer

Backward stability asks whether \overset{y}{^} is the exact answer to a nearby input x + Δ x .

failure mode

Catastrophic cancellation can make x + 1 - x return zero when the stable form is 1/ (x + 1 + x) .

Prerequisites

This page assumes familiarity with floating-point arithmetic (IEEE 754, machine epsilon, rounding) and matrix operations (norms, eigenvalues, singular values).

Core Definitions

Definition

Forward Error $\overset{y}{^} - y$

The forward error of a computation is the difference between the computed result $\hat{y}$ and the true result $y$ :

$\text{forward error} = \|\hat{y} - y\|$

Relative forward error: $\|\hat{y} - y\| / \|y\|$ .

Definition

Backward Error

The backward error of a computation asks: for what perturbed input $\hat{x}$ does the exact algorithm $f$ produce the computed output $\hat{y}$ ? That is, find the smallest $\Delta x$ such that:

$\hat{y} = f(x + \Delta x)$

The backward error is $\|\Delta x\| / \|x\|$ . A small backward error means the computed answer is the exact answer to a nearby problem.

Definition

Condition Number $κ (A)$

The condition number of a problem measures how sensitive the output is to small perturbations in the input. For a differentiable function $f$ at input $x$ :

$\kappa = \sup_{\|\Delta x\| \leq \epsilon} \frac{\|f(x + \Delta x) - f(x)\| / \|f(x)\|}{\|\Delta x\| / \|x\|}$

For a matrix $A$ , the condition number with respect to solving $Ax = b$ is:

$\kappa(A) = \|A\| \cdot \|A^{-1}\| = \frac{\sigma_{\max}(A)}{\sigma_{\min}(A)}$

where $\sigma_{\max}$ and $\sigma_{\min}$ are the largest and smallest singular values. A large condition number means the problem is ill-conditioned: small input changes produce large output changes.

Definition

Numerical Stability

"Numerical stability" is an umbrella term for several distinct, increasingly demanding conditions; modern usage (Higham, Accuracy and Stability of Numerical Algorithms, Ch. 1) distinguishes:

Forward stable: the computed $\hat y$ is close to the true $y = f(x)$ , with relative forward error of order $\kappa(x) \cdot \epsilon_{\text{mach}}$ — i.e. as accurate as the conditioning permits.
Mixed forward-backward stable: $\hat y = f(\hat x) + \Delta y$ with both $\|\hat x - x\|/\|x\|$ and $\|\Delta y\|/\|y\|$ of order $\epsilon_{\text{mach}}$ .
Backward stable: $\hat y = f(\hat x)$ exactly for some $\hat x$ with $\|\hat x - x\|/\|x\| = O(\epsilon_{\text{mach}})$ — the computed result is the exact answer to a nearby problem.

$\hat{y} = f(x + \Delta x), \quad \|\Delta x\| / \|x\| = O(\epsilon_{\text{mach}})$

Backward stability ⟹ mixed forward-backward stability ⟹ forward stability (relative to conditioning), but the converses fail in general. Backward stability is the gold standard because (a) it implies the others and (b) the actual error is then bounded purely by the condition number $\kappa$ via $\text{forward error} \leq \kappa \cdot \text{backward error}$ . Loose use of "stable" without qualification typically means backward stable in numerical linear algebra and forward stable in optimization.

The Central Relationship

The forward error is bounded by the product of condition number and backward error:

$\frac{\|\hat{y} - y\|}{\|y\|} \leq \kappa \cdot \frac{\|\Delta x\|}{\|x\|}$

This means:

Well-conditioned + backward stable = accurate result. Small backward error, small amplification.
Ill-conditioned + backward stable = large forward error, but the best you can do. The problem itself is sensitive; no algorithm can avoid the amplification.
Well-conditioned + unstable = avoidable inaccuracy. Switch to a better algorithm.
Ill-conditioned + unstable = disaster. Both the problem and the algorithm contribute to error.

Main Theorems

Theorem

Backward Stability of Gaussian Elimination with Partial Pivoting

Statement

Gaussian elimination with partial pivoting computes the solution $\hat{x}$ of $Ax = b$ such that:

$(A + \Delta A)\hat{x} = b, \quad \|\Delta A\| \leq g(n) \cdot u \cdot \|A\|$

where $u$ is the unit roundoff and $g(n)$ is a growth factor. For partial pivoting, $g(n) \leq 2^{n-1}$ in the worst case. This exponential bound is pessimistic for most practical matrices, but the theorem's honest guarantee is still growth-factor dependent: pivoting makes Gaussian elimination reliable in ordinary use, not magically immune to pathological examples.

The forward error satisfies:

$\frac{\|\hat{x} - x\|}{\|x\|} \leq \kappa(A) \cdot g(n) \cdot u$

Intuition

Gaussian elimination with partial pivoting is backward stable: the computed solution $\hat{x}$ is the exact solution of a slightly perturbed system $(A + \Delta A)\hat{x} = b$ . The perturbation is small (on the order of machine epsilon times the matrix norm). The forward error is this backward error amplified by the condition number $\kappa(A)$ . If $A$ is well-conditioned, the result is accurate. If $A$ is ill-conditioned, the error can be large, but no direct method can do better.

Proof Sketch

Track the rounding errors through the elimination process. Each elementary row operation introduces a relative error of $O(u)$ . Partial pivoting ensures that the multipliers are bounded by 1, preventing individual steps from amplifying errors excessively. The accumulated error after $O(n^2)$ operations is bounded by $g(n) u \|A\|$ , where the growth factor $g(n)$ accounts for the worst-case accumulation through the pivot sequence.

The full proof appears in Higham, Accuracy and Stability of Numerical Algorithms, Chapter 9.

Why It Matters

This theorem explains why Gaussian elimination works well in practice despite theoretical worst-case growth factors of $2^{n-1}$ . It also explains why condition number matters: even with a perfect algorithm, an ill-conditioned matrix produces large forward error. In ML, condition numbers arise everywhere: the Hessian condition number determines gradient descent convergence speed, the condition number of the data matrix affects linear regression stability, and the condition number of attention logits affects softmax precision.

Failure Mode

Without pivoting, Gaussian elimination can be catastrophically unstable even for well-conditioned matrices. The growth factor can be exponential. Example: the matrix $\begin{pmatrix} \epsilon & 1 \\ 1 & 1 \end{pmatrix}$ for small $\epsilon$ produces enormous multipliers ( $1/\epsilon$ ) without pivoting, leading to complete loss of precision.

Partial pivoting fails in rare pathological cases (Wilkinson matrices) where $g(n) = 2^{n-1}$ . Complete pivoting reduces $g(n)$ further but is more expensive.

report a correction →

Proposition

Catastrophic Cancellation

Statement

Let $a$ and $b$ be floating-point numbers with $a \approx b$ . The relative error in computing $\hat{c} = \text{fl}(a - b)$ can be as large as:

$\frac{|\hat{c} - (a - b)|}{|a - b|} \leq \frac{2u \cdot \max(|a|, |b|)}{|a - b|}$

When $|a - b| \ll |a|$ , this relative error is much larger than $u$ . In the extreme, when $a$ and $b$ agree in $k$ leading digits, the result $a - b$ has only $p - k$ significant digits (where $p$ is the total precision).

Intuition

Subtraction of nearly equal numbers cancels the leading significant digits, leaving only the noisy trailing digits. The absolute error in $\hat{c}$ is no worse than before (bounded by $u \cdot \max(|a|, |b|)$ ), but the result $|a - b|$ is tiny, so the relative error explodes.

Example: in 7-digit arithmetic, $1.234567 - 1.234566 = 0.000001$ . But if the last digits were rounded, the result might be $0.000002$ or $0.000000$ , giving 100% relative error.

Proof Sketch

The absolute error in $a - b$ is bounded by $u(|a| + |b|) \leq 2u \max(|a|, |b|)$ (from the rounding errors already present in $a$ and $b$ ). Dividing by $|a - b|$ gives the relative error bound. When $|a - b|$ is small compared to $|a|$ , the ratio is large.

Why It Matters

Catastrophic cancellation explains many numerical failures in ML:

Computing $\text{Var}(X) = E[X^2] - (E[X])^2$ fails when the variance is small relative to the mean, because both terms are large and nearly equal. The stable alternative: compute the variance using the two-pass algorithm or Welford's online algorithm.
Softmax: $\exp(x_i) / \sum_j \exp(x_j)$ overflows when $x_i$ is large. The fix: subtract $\max_j x_j$ from all entries. This is numerically equivalent but avoids overflow.
LogSumExp: $\log \sum_j \exp(x_j)$ requires the same trick to avoid overflow.

report a correction →

Backward vs Forward Error Analysis

The Wilkinson framework, developed by J.H. Wilkinson in the 1960s, provides a systematic way to analyze numerical algorithms. The key insight is to ask not "how wrong is the output?" but "for what exact problem is the output the correct answer?"

Forward error analysis tracks how rounding errors accumulate through each operation and bounds the total deviation of the computed result from the true result. It is direct but tends to be pessimistic: errors are assumed to compound in the worst direction at every step, which rarely happens in practice.

Backward error analysis takes the computed output $\hat{y}$ and asks: what is the smallest perturbation $\delta x$ to the input $x$ such that $f(x + \delta x) = \hat{y}$ exactly? The backward error is $\|\delta x\| / \|x\|$ . An algorithm is backward stable if and only if this ratio is $O(\epsilon_{\text{mach}})$ for all inputs.

The central theorem connecting the two:

$\text{forward error} \leq \kappa \cdot \text{backward error}$

where $\kappa$ is the condition number of the problem. This decomposes error into two orthogonal contributions:

Problem sensitivity ( $\kappa$ ): intrinsic, cannot be improved by changing the algorithm.
Algorithm quality (backward error): can be improved by choosing a better algorithm.

A backward stable algorithm achieves backward error of $O(\epsilon_{\text{mach}})$ . It "does as well as the problem allows"; the only remaining error is from conditioning.

Wilkinson's specific contributions include the backward-error framework for Gaussian elimination and the growth-factor analysis that explains why partial pivoting is usually reliable despite rare worst cases. The growth factor $g(n) = \max_{k} \|A^{(k)}\|_\infty / \|A\|_\infty$ (where $A^{(k)}$ is the matrix after $k$ elimination steps) is the key quantity bounding backward error.

In floating-point arithmetic, every elementary operation $\text{fl}(a \circ b) = (a \circ b)(1 + \delta)$ with $|\delta| \leq \epsilon_{\text{mach}}$ . A backward error analysis chains these single-step bounds through the entire computation. For matrix-vector multiplication $y = Ax$ with an $n \times n$ matrix, the computed $\hat{y}$ can be written as

$\hat{y}=(A+\Delta A)x,\qquad |\Delta A_{ij}| \lesssim n\epsilon_{\text{mach}}|A_{ij}|.$

So the matrix-vector product is backward stable with respect to the matrix entries. The relative forward error is controlled by the condition number of the multiplication problem, roughly $\|A\|\,\|x\|/\|Ax\|$ , not by the linear-solve condition number $\|A\|\|A^{-1}\|$ unless you are actually solving $Ax=b$ .

The practical implication for ML: when a training run produces NaN or wildly large gradients, the question is whether the problem is conditioning (the Hessian eigenvalue ratio is extreme) or stability (the algorithm is amplifying errors beyond what conditioning requires). Backward error analysis distinguishes the two.

Key Stability Tricks in ML

Log-Sum-Exp

The naive computation of $\log \sum_{j=1}^n \exp(x_j)$ overflows when any $x_j > 709$ (for float64) or $x_j > 88$ (for float32). The stable version:

$\text{LogSumExp}(x) = m + \log \sum_{j=1}^n \exp(x_j - m), \quad m = \max_j x_j$

Subtracting $m$ ensures all exponents are $\leq 0$ , so $\exp(x_j - m) \in (0, 1]$ . This is mathematically identical but numerically safe.

Softmax Stability

Similarly, $\text{softmax}(x)_i = \exp(x_i) / \sum_j \exp(x_j)$ is computed as:

$\text{softmax}(x)_i = \exp(x_i - m) / \sum_j \exp(x_j - m)$

This prevents overflow in the numerator and denominator.

Variance Computation

The textbook formula $\text{Var}(X) = E[X^2] - (E[X])^2$ suffers from catastrophic cancellation when the mean is large and the variance is small. Welford's algorithm computes the variance in a single pass with $O(1)$ memory and avoids this:

Update running mean $\bar{x}_n$ and sum of squared differences $M_n$ :

$\delta = x_n - \bar{x}_{n-1}$
$\bar{x}_n = \bar{x}_{n-1} + \delta / n$
$M_n = M_{n-1} + \delta (x_n - \bar{x}_n)$
$\text{Var} = M_n / (n - 1)$

Attention Scaling

In the transformer, attention logits are $QK^T / \sqrt{d_k}$ where $d_k$ is the head dimension. For $q, k \in \mathbb{R}^{d_k}$ with i.i.d. unit-variance entries, the dot product $q \cdot k$ has variance $O(d_k)$ and therefore typical magnitude $O(\sqrt{d_k})$ . Without the $\sqrt{d_k}$ scaling the logits would grow as $\sqrt{d_k}$ , pushing the softmax into its saturated regime where gradients vanish. Dividing by $\sqrt{d_k}$ restores unit variance and keeps the logits in a range where softmax has meaningful gradients.

Mixed-Precision Training

Modern ML training uses float16 (or bfloat16) for forward and backward passes and float32 for weight updates. The accumulation in float32 prevents gradient underflow: gradients in float16 can underflow to zero when they are smaller than $2^{-24} \approx 6 \times 10^{-8}$ . Loss scaling (multiplying the loss by a large constant, computing gradients, then dividing by the same constant) shifts the gradient distribution into the representable range.

Residual Connections and BatchNorm

Residual connections ( $y = x + f(x)$ ) add a skip path that preserves signal magnitude. Without them, signals in deep networks can shrink exponentially (vanishing gradients) or grow exponentially (exploding gradients). BatchNorm normalizes intermediate activations to zero mean and unit variance. The original paper framed this as reducing internal covariate shift; later work argues the main benefit is often smoother optimization geometry and better gradient conditioning. Either way, the practical lesson is scale control: deep nets train better when activation and gradient magnitudes stay in a numerically useful range.

Stability Debugger for ML

Symptom	Likely numerical mechanism	Better response
NaNs in softmax or cross-entropy	exponent overflow or $\log(0)$	Use fused `log_softmax` / cross-entropy implementations
Negative or zero variance estimate	cancellation in $E[X^2]-(E[X])^2$	Use two-pass variance or Welford updates
Tiny residual but bad solution	ill-conditioned linear system	Inspect $\kappa(A)$ , regularize, precondition, or rescale
fp16 gradients become zero	underflow	Use loss scaling, bfloat16, or float32 accumulation
Loss spikes after a learning-rate change	unstable discretization / step size too large	Reduce step size, add clipping, check normalization

Common Confusions

Watch Out

Conditioning is a property of the problem, not the algorithm

A matrix with condition number $10^{15}$ produces large forward errors under ANY algorithm, not just unstable ones. No amount of algorithmic cleverness can compensate for ill-conditioning. The only solution is to reformulate the problem (regularization, preconditioning) or accept the limited precision.

Watch Out

Small residual does not mean accurate solution

For the linear system $Ax = b$ , the residual $r = b - A\hat{x}$ can be small even when $\hat{x}$ is far from the true $x$ , if $A$ is ill-conditioned. The bound is $\|x - \hat{x}\| / \|x\| \leq \kappa(A) \cdot \|r\| / \|b\|$ . With $\kappa(A) = 10^{10}$ and $\|r\|/\|b\| = 10^{-16}$ (machine epsilon), the solution error can be $10^{-6}$ , which is only 6 digits of accuracy despite a tiny residual.

Watch Out

float16 is not just float32 with less precision

float16 has a much smaller dynamic range ( $10^{-8}$ to $65504$ ) compared to float32 ( $10^{-38}$ to $10^{38}$ ). Underflow and overflow are not just precision issues; they produce zeros and infinities. bfloat16 has the same dynamic range as float32 (same exponent bits) but less precision: 8 vs 24 total significand bits (counting the implicit leading 1), or equivalently 7 vs 23 explicit stored mantissa bits. The choice between float16 and bfloat16 depends on whether precision or dynamic range is more important for the application.

Watch Out

Numerical stability and algorithmic stability are different concepts

Numerical stability concerns rounding errors in finite-precision arithmetic. Algorithmic stability (in the ML sense) concerns how much the learned model changes when one training example is added or removed. Both use the word "stability" but they are different properties. A learning algorithm can be algorithmically stable (small model change per sample) while being numerically unstable (rounding errors corrupt the gradient), or vice versa.

Summary

Forward error = condition number times backward error. This decomposition separates the problem's sensitivity from the algorithm's accuracy.
Gaussian elimination with partial pivoting is backward stable; the forward error is controlled by the condition number.
Catastrophic cancellation occurs when subtracting nearly equal numbers, destroying relative precision.
The log-sum-exp trick, softmax centering, and Welford variance are standard defenses against numerical instability.
Attention scaling by $1/\sqrt{d_k}$ prevents softmax saturation.
Mixed-precision training uses float16 for speed and float32 for accumulation to avoid gradient underflow.
Conditioning is a problem property; stability is an algorithm property. Know which one is causing your trouble.

Exercises

ExerciseCore

Problem

The matrix $A = \begin{pmatrix} 1 & 1 \\ 1 & 1.0001 \end{pmatrix}$ has trace $2.0001$ and determinant $0.0001$ , giving eigenvalues approximately $\lambda_1 \approx 2.00005$ and $\lambda_2 \approx 0.00005$ . Compute its condition number $\kappa_2(A) = \sigma_{\max}/\sigma_{\min}$ . If you solve $Ax = b$ with a backward stable algorithm in float64 ( $u \approx 10^{-16}$ ), what is the approximate worst-case relative error in $x$ ?

ExerciseCore

Problem

Show that computing $f(x) = \sqrt{x+1} - \sqrt{x}$ for large $x$ suffers from catastrophic cancellation. Derive a mathematically equivalent form that is numerically stable.

ExerciseAdvanced

Problem

In mixed-precision training, gradients are computed in float16 and accumulated in float32. Explain why loss scaling is necessary. If the typical gradient magnitude is $10^{-5}$ and float16 has minimum positive normal $\approx 6 \times 10^{-8}$ , what loss scale factor prevents underflow?

ExerciseAdvanced

Problem

The two-pass variance algorithm computes $\bar{x} = \frac{1}{n}\sum x_i$ first, then $s^2 = \frac{1}{n-1}\sum (x_i - \bar{x})^2$ . The one-pass formula $s^2 = \frac{1}{n-1}(\sum x_i^2 - n\bar{x}^2)$ uses catastrophic cancellation. Construct a dataset of 4 numbers where the one-pass formula gives a negative variance in 4-digit decimal arithmetic with rounding after each operation, but the two-pass formula gives a positive result.

References

Canonical:

Higham, Accuracy and Stability of Numerical Algorithms (2nd ed., 2002), Chapters 1-3 (backward error, condition numbers) and Chapter 9 (Gaussian elimination stability)
Trefethen & Bau, Numerical Linear Algebra (1997), Lectures 12-16 (conditioning, backward stability, QR)
Wilkinson, Rounding Errors in Algebraic Processes (1963), Chapters 1-2 — the original backward error framework

ML Connections:

Micikevicius et al., "Mixed Precision Training" (ICLR 2018), Section 3 (loss scaling, float16 underflow)
Vaswani et al., "Attention Is All You Need" (2017), Section 3.2.1 (scaling by $\sqrt{d_k}$ )
Ioffe & Szegedy, "Batch Normalization" (ICML 2015), Section 3 (normalization as numerical stabilization)
Santurkar et al., "How Does Batch Normalization Help Optimization?" (NeurIPS 2018), Section 3 (optimization landscape smoothing)

Accessible:

Goldberg, "What Every Computer Scientist Should Know About Floating-Point Arithmetic" (1991), Sections 1-3 (rounding, cancellation, IEEE 754)

Next Topics

Conditioning and condition number: deeper treatment of condition numbers for specific problems in optimization and linear algebra

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

3

Floating-Point Arithmeticlayer 0A · tier 1
Matrix Normslayer 0A · tier 1
Matrix Operations and Propertieslayer 0A · tier 1

Derived topics

1

Conditioning and Condition Numberlayer 1 · tier 1

Graph-backed continuations

Conditioning and Condition Number