Numerical Optimization
Floating-Point Arithmetic
How computers represent real numbers, why they get it wrong, and why ML uses float32, float16, bfloat16, and int8. IEEE 754, machine epsilon, overflow, underflow, and catastrophic cancellation.
Why This Matters
Every number in your neural network. every weight, gradient, activation, and loss value. is a floating-point number with finite precision. When you train a model and the loss becomes NaN, when gradients explode or vanish, when two numbers that should be equal are not, floating-point arithmetic is the root cause.
Understanding floating-point is not optional for ML practitioners. It explains why we use log-sum-exp instead of summing exponentials, why bfloat16 works for training but float16 sometimes does not, and why numerical stability is a real engineering constraint.
Mental Model
A floating-point number is scientific notation for computers. Just as has a significand (6.022) and an exponent (23), a floating-point number has a mantissa and an exponent, but in base 2. The mantissa gives you precision (how many significant digits), and the exponent gives you range (how large or small the number can be).
The fundamental limitation: you have a fixed number of bits for the mantissa, so most real numbers cannot be represented exactly. They get rounded to the nearest representable number.
Formal Setup and Notation
IEEE 754 Floating-Point Representation
A floating-point number in IEEE 754 format is stored as three fields:
where is the sign bit (0 for positive, 1 for negative), is the mantissa (also called significand or fraction), and is the exponent. The leading 1 in "" is implicit (not stored), giving one extra bit of precision for free.
For float32: 1 sign bit, 8 exponent bits, 23 mantissa bits (32 total). For float64: 1 sign bit, 11 exponent bits, 52 mantissa bits (64 total).
Unit Roundoff
Two related constants get called "machine epsilon" depending on which community is talking. Both matter; this page uses both names where they are standard.
Unit roundoff is the worst-case relative rounding error for a single correctly-rounded operation:
where is the number of mantissa bits including the implicit leading
- This is the constant in the standard floating-point error model below. Numerical-analysis textbooks (Higham) use this way.
Spacing at 1, sometimes called , is the distance from to the next representable float:
This is what C's FLT_EPSILON / Python numpy.finfo(...).eps returns.
It is exactly twice the unit roundoff: .
| Format | |||
|---|---|---|---|
| float32 | 24 | ||
| float64 | 53 |
When you read about "machine epsilon," check whether the context is analytic (then it usually means ) or programming (then it usually means ). The error model below uses , not .
For backward compatibility with this page, below denotes (the analytic convention).
ULP (Unit in the Last Place)
The ULP of a floating-point number is the spacing between and the next representable floating-point number. For in float32, the ULP is — exactly above. ULP grows with the magnitude of : large numbers have larger gaps between representable values.
Core Definitions
Overflow occurs when the result of a computation exceeds the largest representable number. In float32, the maximum is approximately . Overflow produces infinity ( or ). Computing in float32 overflows.
Underflow occurs when the result is closer to zero than the smallest
representable normal number. In float32, the smallest normal number is
approximately and the smallest positive subnormal
is about . A result smaller than the smallest normal
but larger than the smallest subnormal becomes a subnormal (denormalized)
float, with reduced precision; only a result smaller than the smallest
subnormal underflows to zero. Computing in float32 lands in the subnormal range (not zero); computing
underflows to zero. Some accelerated
ML kernels and CPU flags (FTZ, DAZ) flush subnormals to zero for
speed, so the same expression can land at zero on one device and on a
subnormal on another.
Catastrophic cancellation occurs when subtracting two nearly equal numbers. If and in float32, then . But if and each have 7 digits of precision, has only 1 digit of precision. The relative error explodes.
This is why computing is numerically unstable: if the mean is large relative to the standard deviation, you subtract two large, nearly equal numbers.
Why ML Uses Reduced Precision
| Format | Sign | Exponent | Mantissa | Total bits | Precision | Range |
|---|---|---|---|---|---|---|
| float32 | 1 | 8 | 23 | 32 | ~7 digits | ~ |
| float16 | 1 | 5 | 10 | 16 | ~3 digits | ~ |
| bfloat16 | 1 | 8 | 7 | 16 | ~2 digits | ~ |
| int8 | - | - | - | 8 | 256 values |
float16 has limited range (max ~65504) but decent precision. Gradients and activations in deep networks can exceed 65504, causing overflow.
bfloat16 sacrifices precision for range. It has the same exponent width (and hence dynamic range) as float32, with only 7 mantissa bits, so overflow is rare and most kernels skip the loss-scaling machinery that float16 needs. On supported hardware (TPUs, recent GPUs) bfloat16 is often the default for training. float16 is still the right path on some accelerators, with proper loss scaling and tensor-core kernels.
int8 quantization maps continuous values to 256 discrete levels. Used
mainly for inference to reduce memory and compute by 4x vs float32; the
canonical references are Jacob et al. (2018) for integer-only inference
and Dettmers et al. (2022) LLM.int8() for transformer-specific
techniques. Int8 training is an active area but not the standard path.
Main Theorems
Fundamental Axiom of Floating-Point Arithmetic
Statement
For any real number in the representable range, the floating-point representation satisfies:
For any arithmetic operation :
Each floating-point operation introduces a relative error of at most .
Intuition
Every floating-point operation is "almost right": the result is within a factor of of the true answer, with the unit roundoff. The problem is that errors accumulate over many operations. The standard worst-case bound for chained operations (Higham) uses
valid when ; for small this is approximately . For random, unbiased rounding errors that behave like independent zero-mean perturbations, a heuristic average-case scale closer to often holds, but this is a model assumption, not a theorem; pathological inputs and structured cancellation can defeat the heuristic.
Proof Sketch
The nearest floating-point number to differs from by at most half a ULP. The ULP of is where is the mantissa precision. So . Dividing by gives the relative error bound.
Why It Matters
This axiom is the foundation of numerical analysis. Every error analysis in scientific computing starts from this bound. It tells you that single operations are accurate, and the challenge is controlling error accumulation over long computations (like training a neural network for millions of steps).
Failure Mode
The bound assumes no overflow or underflow. In the overflow regime (result exceeds max representable), you get infinity. In the underflow regime (result is too small), you lose relative accuracy because denormalized numbers have fewer significant bits.
Worked Examples of Precision Loss
Catastrophic cancellation in variance computation
Compute the variance of in float32.
Naive formula: .
(exact in float32). .
In float32, is exact, but has 9 significant digits while float32 provides only 7. The stored value is approximately , losing the last digit. Then . Both numbers agree in their first 7 significant digits, so the subtraction produces a result with approximately 0 reliable digits. The computed variance might be 0.0, 1.0, or any value in between, depending on rounding.
The stable one-pass algorithm (Welford's method) accumulates incrementally, avoiding the subtraction of two large numbers. The differences are small, so no cancellation occurs. Divide by (population variance) or (sample variance) at the end; for these are and respectively. The naive formula above targets the population form .
Gradient accumulation in mixed precision training
In mixed-precision training, gradients are computed in float16 or bfloat16, then accumulated in float32. Consider adding a gradient of magnitude to a running sum that has reached . In float16, the smallest representable change to is about . Since , the addition rounds back to . The small gradient is lost entirely.
In float32, the smallest representable change to is about , which is still larger than , so even float32 loses this gradient. The fix: use Kahan summation or accumulate in float64 for critical quantities like loss values.
This is why loss scaling is used in mixed-precision training: multiply the loss by a large constant (e.g., ) before backpropagation, which scales all gradients by the same factor, then divide by the scale factor when updating weights in float32. This shifts small gradients into the representable range of float16.
Canonical Examples
Why 0.1 + 0.2 does not equal 0.3
The decimal number 0.1 has no exact binary representation (it is a repeating fraction in base 2, like 1/3 in base 10). In float64, , and similarly for 0.2. Their sum is , which rounds to a different float64 than
Log-sum-exp trick
Computing directly overflows if any is large. The log-sum-exp trick: let , then . Now the largest exponent is , so no overflow. This is how softmax is computed in every deep learning framework.
Common Confusions
Floating-point is not interval arithmetic
Floating-point numbers are not uniformly spaced. Near zero they are dense; near the maximum they are sparse. The gap between and the next float32 is ; the gap between and the next float32 is . Relative error is constant, but absolute error grows with magnitude.
Double precision does not fix all problems
Switching from float32 to float64 gives more precision but does not fix Unstable algorithms. If an algorithm amplifies errors by (ill-conditioned), float64 just delays the problem. Fix the algorithm, not the precision.
Summary
- IEEE 754:
- Machine epsilon: worst-case relative error per operation; for float32, for float64
- Overflow: number too large, becomes infinity
- Underflow: number too small, becomes zero
- Catastrophic cancellation: subtracting nearly equal numbers destroys precision
- bfloat16 has float32's range but lower precision; preferred for training
- Always use log-space arithmetic for products of probabilities
Exercises
Problem
In float32 (), you compute and . How many significant digits does have?
Problem
Explain why bfloat16 (8 exponent bits, 7 mantissa bits) is preferred over float16 (5 exponent bits, 10 mantissa bits) for training neural networks, despite having less precision.
References
Canonical:
- Higham, Accuracy and Stability of Numerical Algorithms (2nd ed., SIAM 2002). The standard reference for the error model, , and stability analysis of basic algorithms.
- Goldberg, "What Every Computer Scientist Should Know About Floating-Point Arithmetic" (1991).
- IEEE Standard 754-2019 for Floating-Point Arithmetic.
Current:
- Micikevicius et al., "Mixed Precision Training" (ICLR 2018). FP16 + loss scaling.
- Kalamkar et al., "A Study of BFLOAT16 for Deep Learning Training" (2019). bfloat16 exponent/range tradeoff.
- Jacob et al., "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference" (CVPR 2018). The canonical int8 inference reference.
- Dettmers et al., "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale" (NeurIPS 2022). Transformer-specific int8 inference.
- Dettmers et al., "8-bit Optimizers via Block-wise Quantization" (ICLR 2022). 8-bit optimizer state for training memory savings — different problem from int8 inference.
Next Topics
The natural next steps from floating-point arithmetic:
- Whitening and decorrelation: improving numerical conditioning of data
- Numerical linear algebra: stable algorithms for solving linear systems
Last reviewed: April 13, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
0No direct prerequisites are declared; this is treated as an entry point.
Derived topics
4- Numerical Stability and Conditioninglayer 1 · tier 1
- Computer Architecture for MLlayer 2 · tier 2
- Whitening and Decorrelationlayer 2 · tier 2
- Mixed Precision Traininglayer 3 · tier 2