Skip to main content

Numerical Optimization

Floating-Point Arithmetic

How computers represent real numbers, why they get it wrong, and why ML uses float32, float16, bfloat16, and int8. IEEE 754, machine epsilon, overflow, underflow, and catastrophic cancellation.

CoreTier 1StableSupporting~40 min

Why This Matters

Every number in your neural network. every weight, gradient, activation, and loss value. is a floating-point number with finite precision. When you train a model and the loss becomes NaN, when gradients explode or vanish, when two numbers that should be equal are not, floating-point arithmetic is the root cause.

Understanding floating-point is not optional for ML practitioners. It explains why we use log-sum-exp instead of summing exponentials, why bfloat16 works for training but float16 sometimes does not, and why numerical stability is a real engineering constraint.

Mental Model

A floating-point number is scientific notation for computers. Just as 6.022×10236.022 \times 10^{23} has a significand (6.022) and an exponent (23), a floating-point number has a mantissa and an exponent, but in base 2. The mantissa gives you precision (how many significant digits), and the exponent gives you range (how large or small the number can be).

The fundamental limitation: you have a fixed number of bits for the mantissa, so most real numbers cannot be represented exactly. They get rounded to the nearest representable number.

Formal Setup and Notation

Definition

IEEE 754 Floating-Point Representation

A floating-point number in IEEE 754 format is stored as three fields:

(1)s×1.m×2ebias(-1)^s \times 1.m \times 2^{e - \text{bias}}

where ss is the sign bit (0 for positive, 1 for negative), mm is the mantissa (also called significand or fraction), and ee is the exponent. The leading 1 in "1.m1.m" is implicit (not stored), giving one extra bit of precision for free.

For float32: 1 sign bit, 8 exponent bits, 23 mantissa bits (32 total). For float64: 1 sign bit, 11 exponent bits, 52 mantissa bits (64 total).

Definition

Unit Roundoff

Two related constants get called "machine epsilon" depending on which community is talking. Both matter; this page uses both names where they are standard.

Unit roundoff uu is the worst-case relative rounding error for a single correctly-rounded operation:

u=2pu = 2^{-p}

where pp is the number of mantissa bits including the implicit leading

  1. This is the constant in the standard (1+δ)(1 + \delta) floating-point error model below. Numerical-analysis textbooks (Higham) use uu this way.

Spacing at 1, sometimes called εnext\varepsilon_{\text{next}}, is the distance from 11 to the next representable float:

εnext=nextafter(1,+)1=21p.\varepsilon_{\text{next}} = \mathrm{nextafter}(1, +\infty) - 1 = 2^{1-p}.

This is what C's FLT_EPSILON / Python numpy.finfo(...).eps returns. It is exactly twice the unit roundoff: εnext=2u\varepsilon_{\text{next}} = 2u.

Formatppu=2pu = 2^{-p}εnext=21p\varepsilon_{\text{next}} = 2^{1-p}
float32245.96×108\approx 5.96 \times 10^{-8}1.19×107\approx 1.19 \times 10^{-7}
float64531.11×1016\approx 1.11 \times 10^{-16}2.22×1016\approx 2.22 \times 10^{-16}

When you read about "machine epsilon," check whether the context is analytic (then it usually means uu) or programming (then it usually means εnext\varepsilon_{\text{next}}). The (1+δ)(1 + \delta) error model below uses uu, not εnext\varepsilon_{\text{next}}.

For backward compatibility with this page, ϵmach\epsilon_{\text{mach}} below denotes u=2pu = 2^{-p} (the analytic convention).

Definition

ULP (Unit in the Last Place)

The ULP of a floating-point number xx is the spacing between xx and the next representable floating-point number. For x=1.0x = 1.0 in float32, the ULP is 2231.19×1072^{-23} \approx 1.19 \times 10^{-7} — exactly εnext\varepsilon_{\text{next}} above. ULP grows with the magnitude of xx: large numbers have larger gaps between representable values.

Core Definitions

Overflow occurs when the result of a computation exceeds the largest representable number. In float32, the maximum is approximately 3.4×10383.4 \times 10^{38}. Overflow produces infinity (++\infty or -\infty). Computing e100e^{100} in float32 overflows.

Underflow occurs when the result is closer to zero than the smallest representable normal number. In float32, the smallest normal number is approximately 1.18×10381.18 \times 10^{-38} and the smallest positive subnormal is about 1.4×10451.4 \times 10^{-45}. A result smaller than the smallest normal but larger than the smallest subnormal becomes a subnormal (denormalized) float, with reduced precision; only a result smaller than the smallest subnormal underflows to zero. Computing e1003.7×1044e^{-100} \approx 3.7 \times 10^{-44} in float32 lands in the subnormal range (not zero); computing e1507×1066e^{-150} \approx 7 \times 10^{-66} underflows to zero. Some accelerated ML kernels and CPU flags (FTZ, DAZ) flush subnormals to zero for speed, so the same expression can land at zero on one device and on a subnormal on another.

Catastrophic cancellation occurs when subtracting two nearly equal numbers. If a=1.000001a = 1.000001 and b=1.000000b = 1.000000 in float32, then ab=0.000001a - b = 0.000001. But if aa and bb each have 7 digits of precision, aba - b has only 1 digit of precision. The relative error explodes.

This is why computing Var(X)=E[X2](E[X])2\mathrm{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 is numerically unstable: if the mean is large relative to the standard deviation, you subtract two large, nearly equal numbers.

Why ML Uses Reduced Precision

FormatSignExponentMantissaTotal bitsPrecisionRange
float32182332~7 digits~103810^{38}
float16151016~3 digits~6.5×1046.5 \times 10^4
bfloat1618716~2 digits~103810^{38}
int8---8256 values[128,127][-128, 127]

float16 has limited range (max ~65504) but decent precision. Gradients and activations in deep networks can exceed 65504, causing overflow.

bfloat16 sacrifices precision for range. It has the same exponent width (and hence dynamic range) as float32, with only 7 mantissa bits, so overflow is rare and most kernels skip the loss-scaling machinery that float16 needs. On supported hardware (TPUs, recent GPUs) bfloat16 is often the default for training. float16 is still the right path on some accelerators, with proper loss scaling and tensor-core kernels.

int8 quantization maps continuous values to 256 discrete levels. Used mainly for inference to reduce memory and compute by 4x vs float32; the canonical references are Jacob et al. (2018) for integer-only inference and Dettmers et al. (2022) LLM.int8() for transformer-specific techniques. Int8 training is an active area but not the standard path.

Main Theorems

Proposition

Fundamental Axiom of Floating-Point Arithmetic

Statement

For any real number xx in the representable range, the floating-point representation fl(x)\mathrm{fl}(x) satisfies:

fl(x)=x(1+δ),δϵmach\mathrm{fl}(x) = x(1 + \delta), \quad |\delta| \leq \epsilon_{\text{mach}}

For any arithmetic operation {+,,×,/}\circ \in \{+, -, \times, /\}:

fl(ab)=(ab)(1+δ),δϵmach\mathrm{fl}(a \circ b) = (a \circ b)(1 + \delta), \quad |\delta| \leq \epsilon_{\text{mach}}

Each floating-point operation introduces a relative error of at most ϵmach\epsilon_{\text{mach}}.

Intuition

Every floating-point operation is "almost right": the result is within a factor of (1±u)(1 \pm u) of the true answer, with u=ϵmach=2pu = \epsilon_{\text{mach}} = 2^{-p} the unit roundoff. The problem is that errors accumulate over many operations. The standard worst-case bound for nn chained operations (Higham) uses

γn=nu1nu,\gamma_n = \frac{n u}{1 - n u},

valid when nu<1n u < 1; for small nun u this is approximately nun u. For random, unbiased rounding errors that behave like independent zero-mean perturbations, a heuristic average-case scale closer to nu\sqrt{n}\, u often holds, but this is a model assumption, not a theorem; pathological inputs and structured cancellation can defeat the heuristic.

Proof Sketch

The nearest floating-point number to xx differs from xx by at most half a ULP. The ULP of xx is x21p|x| \cdot 2^{1-p} where pp is the mantissa precision. So fl(x)x12x21p=x2p=xϵmach|\mathrm{fl}(x) - x| \leq \frac{1}{2}|x| \cdot 2^{1-p} = |x| \cdot 2^{-p} = |x| \epsilon_{\text{mach}}. Dividing by x|x| gives the relative error bound.

Why It Matters

This axiom is the foundation of numerical analysis. Every error analysis in scientific computing starts from this bound. It tells you that single operations are accurate, and the challenge is controlling error accumulation over long computations (like training a neural network for millions of steps).

Failure Mode

The bound assumes no overflow or underflow. In the overflow regime (result exceeds max representable), you get infinity. In the underflow regime (result is too small), you lose relative accuracy because denormalized numbers have fewer significant bits.

Worked Examples of Precision Loss

Example

Catastrophic cancellation in variance computation

Compute the variance of x=[10000.0,10001.0,10002.0]x = [10000.0, 10001.0, 10002.0] in float32.

Naive formula: Var(x)=E[X2](E[X])2\text{Var}(x) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2.

E[X]=10001.0\mathbb{E}[X] = 10001.0 (exact in float32). E[X2]=(100002+100012+100022)/3=(108+100020001+100040004)/3=100020001.67\mathbb{E}[X^2] = (10000^2 + 10001^2 + 10002^2)/3 = (10^8 + 100020001 + 100040004)/3 = 100020001.67.

In float32, 100002=10810000^2 = 10^8 is exact, but 100012=10002000110001^2 = 100020001 has 9 significant digits while float32 provides only 7. The stored value is approximately 1.00020000×1081.00020000 \times 10^8, losing the last digit. Then E[X2](E[X])2100020001.67100020001.0\mathbb{E}[X^2] - (\mathbb{E}[X])^2 \approx 100020001.67 - 100020001.0. Both numbers agree in their first 7 significant digits, so the subtraction produces a result with approximately 0 reliable digits. The computed variance might be 0.0, 1.0, or any value in between, depending on rounding.

The stable one-pass algorithm (Welford's method) accumulates M2=i(xixˉ)2M_2 = \sum_i (x_i - \bar{x})^2 incrementally, avoiding the subtraction of two large numbers. The differences xixˉx_i - \bar{x} are small, so no cancellation occurs. Divide M2M_2 by nn (population variance) or n1n-1 (sample variance) at the end; for x=[10000,10001,10002]x = [10000, 10001, 10002] these are 2/32/3 and 11 respectively. The naive formula above targets the population form E[X2](E[X])2\mathbb{E}[X^2] - (\mathbb{E}[X])^2.

Example

Gradient accumulation in mixed precision training

In mixed-precision training, gradients are computed in float16 or bfloat16, then accumulated in float32. Consider adding a gradient of magnitude 10410^{-4} to a running sum that has reached 10410^4. In float16, the smallest representable change to 10410^4 is about 104×2109.810^4 \times 2^{-10} \approx 9.8. Since 1049.810^{-4} \ll 9.8, the addition 104+10410^4 + 10^{-4} rounds back to 10410^4. The small gradient is lost entirely.

In float32, the smallest representable change to 10410^4 is about 104×2231.2×10310^4 \times 2^{-23} \approx 1.2 \times 10^{-3}, which is still larger than 10410^{-4}, so even float32 loses this gradient. The fix: use Kahan summation or accumulate in float64 for critical quantities like loss values.

This is why loss scaling is used in mixed-precision training: multiply the loss by a large constant (e.g., 2162^{16}) before backpropagation, which scales all gradients by the same factor, then divide by the scale factor when updating weights in float32. This shifts small gradients into the representable range of float16.

Canonical Examples

Example

Why 0.1 + 0.2 does not equal 0.3

The decimal number 0.1 has no exact binary representation (it is a repeating fraction in base 2, like 1/3 in base 10). In float64, fl(0.1)0.1000000000000000055511...\mathrm{fl}(0.1) \approx 0.1000000000000000055511..., and similarly for 0.2. Their sum is 0.3000000000000000444...0.3000000000000000444..., which rounds to a different float64 than fl(0.3)=0.2999999999999999888...\mathrm{fl}(0.3) = 0.2999999999999999888...

Example

Log-sum-exp trick

Computing log(iexi)\log(\sum_i e^{x_i}) directly overflows if any xix_i is large. The log-sum-exp trick: let c=maxixic = \max_i x_i, then log(iexi)=c+log(iexic)\log(\sum_i e^{x_i}) = c + \log(\sum_i e^{x_i - c}). Now the largest exponent is e0=1e^0 = 1, so no overflow. This is how softmax is computed in every deep learning framework.

Common Confusions

Watch Out

Floating-point is not interval arithmetic

Floating-point numbers are not uniformly spaced. Near zero they are dense; near the maximum they are sparse. The gap between 1.01.0 and the next float32 is 107\sim 10^{-7}; the gap between 103010^{30} and the next float32 is 1023\sim 10^{23}. Relative error is constant, but absolute error grows with magnitude.

Watch Out

Double precision does not fix all problems

Switching from float32 to float64 gives more precision but does not fix Unstable algorithms. If an algorithm amplifies errors by 101010^{10} (ill-conditioned), float64 just delays the problem. Fix the algorithm, not the precision.

Summary

  • IEEE 754: (1)s×1.m×2ebias(-1)^s \times 1.m \times 2^{e-\text{bias}}
  • Machine epsilon: worst-case relative error per operation; 107\sim 10^{-7} for float32, 1016\sim 10^{-16} for float64
  • Overflow: number too large, becomes infinity
  • Underflow: number too small, becomes zero
  • Catastrophic cancellation: subtracting nearly equal numbers destroys precision
  • bfloat16 has float32's range but lower precision; preferred for training
  • Always use log-space arithmetic for products of probabilities

Exercises

ExerciseCore

Problem

In float32 (ϵmach6×108\epsilon_{\text{mach}} \approx 6 \times 10^{-8}), you compute a=1.0000001a = 1.0000001 and b=1.0000000b = 1.0000000. How many significant digits does ab=0.0000001a - b = 0.0000001 have?

ExerciseAdvanced

Problem

Explain why bfloat16 (8 exponent bits, 7 mantissa bits) is preferred over float16 (5 exponent bits, 10 mantissa bits) for training neural networks, despite having less precision.

References

Canonical:

  • Higham, Accuracy and Stability of Numerical Algorithms (2nd ed., SIAM 2002). The standard reference for the (1+δ)(1+\delta) error model, γn=nu/(1nu)\gamma_n = nu/(1-nu), and stability analysis of basic algorithms.
  • Goldberg, "What Every Computer Scientist Should Know About Floating-Point Arithmetic" (1991).
  • IEEE Standard 754-2019 for Floating-Point Arithmetic.

Current:

  • Micikevicius et al., "Mixed Precision Training" (ICLR 2018). FP16 + loss scaling.
  • Kalamkar et al., "A Study of BFLOAT16 for Deep Learning Training" (2019). bfloat16 exponent/range tradeoff.
  • Jacob et al., "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference" (CVPR 2018). The canonical int8 inference reference.
  • Dettmers et al., "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale" (NeurIPS 2022). Transformer-specific int8 inference.
  • Dettmers et al., "8-bit Optimizers via Block-wise Quantization" (ICLR 2022). 8-bit optimizer state for training memory savings — different problem from int8 inference.

Next Topics

The natural next steps from floating-point arithmetic:

Last reviewed: April 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

0

No direct prerequisites are declared; this is treated as an entry point.

Derived topics

4