Floating-Point Arithmetic

Sneiderman, Robby

Numerical Optimization

Floating-Point Arithmetic

How computers represent real numbers, why they get it wrong, and why ML uses float32, float16, bfloat16, and int8. IEEE 754, machine epsilon, overflow, underflow, and catastrophic cancellation.

CoreTier 1StableSupporting~40 min

Quiz (3)Prereq Map

Why This Matters

Every number in your neural network. every weight, gradient, activation, and loss value. is a floating-point number with finite precision. When you train a model and the loss becomes NaN, when gradients explode or vanish, when two numbers that should be equal are not, floating-point arithmetic is the root cause.

Understanding floating-point is not optional for ML practitioners. It explains why we use log-sum-exp instead of summing exponentials, why bfloat16 works for training but float16 sometimes does not, and why numerical stability is a real engineering constraint.

Mental Model

A floating-point number is scientific notation for computers. Just as $6.022 \times 10^{23}$ has a significand (6.022) and an exponent (23), a floating-point number has a mantissa and an exponent, but in base 2. The mantissa gives you precision (how many significant digits), and the exponent gives you range (how large or small the number can be).

The fundamental limitation: you have a fixed number of bits for the mantissa, so most real numbers cannot be represented exactly. They get rounded to the nearest representable number.

Formal Setup and Notation

Definition

IEEE 754 Floating-Point Representation

A floating-point number in IEEE 754 format is stored as three fields:

$(-1)^s \times 1.m \times 2^{e - \text{bias}}$

where $s$ is the sign bit (0 for positive, 1 for negative), $m$ is the mantissa (also called significand or fraction), and $e$ is the exponent. The leading 1 in " $1.m$ " is implicit (not stored), giving one extra bit of precision for free.

For float32: 1 sign bit, 8 exponent bits, 23 mantissa bits (32 total). For float64: 1 sign bit, 11 exponent bits, 52 mantissa bits (64 total).

Definition

Unit Roundoff $u$

Two related constants get called "machine epsilon" depending on which community is talking. Both matter; this page uses both names where they are standard.

Unit roundoff $u$ is the worst-case relative rounding error for a single correctly-rounded operation:

$u = 2^{-p}$

where $p$ is the number of mantissa bits including the implicit leading

This is the constant in the standard $(1 + \delta)$ floating-point error model below. Numerical-analysis textbooks (Higham) use $u$ this way.

Spacing at 1, sometimes called $\varepsilon_{\text{next}}$ , is the distance from $1$ to the next representable float:

$\varepsilon_{\text{next}} = \mathrm{nextafter}(1, +\infty) - 1 = 2^{1-p}.$

This is what C's FLT_EPSILON / Python numpy.finfo(...).eps returns. It is exactly twice the unit roundoff: $\varepsilon_{\text{next}} = 2u$ .

Format	$p$	$u = 2^{-p}$	$\varepsilon_{\text{next}} = 2^{1-p}$
float32	24	$\approx 5.96 \times 10^{-8}$	$\approx 1.19 \times 10^{-7}$
float64	53	$\approx 1.11 \times 10^{-16}$	$\approx 2.22 \times 10^{-16}$

When you read about "machine epsilon," check whether the context is analytic (then it usually means $u$ ) or programming (then it usually means $\varepsilon_{\text{next}}$ ). The $(1 + \delta)$ error model below uses $u$ , not $\varepsilon_{\text{next}}$ .

For backward compatibility with this page, $\epsilon_{\text{mach}}$ below denotes $u = 2^{-p}$ (the analytic convention).

Definition

ULP (Unit in the Last Place)

The ULP of a floating-point number $x$ is the spacing between $x$ and the next representable floating-point number. For $x = 1.0$ in float32, the ULP is $2^{-23} \approx 1.19 \times 10^{-7}$ — exactly $\varepsilon_{\text{next}}$ above. ULP grows with the magnitude of $x$ : large numbers have larger gaps between representable values.

Core Definitions

Overflow occurs when the result of a computation exceeds the largest representable number. In float32, the maximum is approximately $3.4 \times 10^{38}$ . Overflow produces infinity ( $+\infty$ or $-\infty$ ). Computing $e^{100}$ in float32 overflows.

Underflow occurs when the result is closer to zero than the smallest representable normal number. In float32, the smallest normal number is approximately $1.18 \times 10^{-38}$ and the smallest positive subnormal is about $1.4 \times 10^{-45}$ . A result smaller than the smallest normal but larger than the smallest subnormal becomes a subnormal (denormalized) float, with reduced precision; only a result smaller than the smallest subnormal underflows to zero. Computing $e^{-100} \approx 3.7 \times 10^{-44}$ in float32 lands in the subnormal range (not zero); computing $e^{-150} \approx 7 \times 10^{-66}$ underflows to zero. Some accelerated ML kernels and CPU flags (FTZ, DAZ) flush subnormals to zero for speed, so the same expression can land at zero on one device and on a subnormal on another.

Catastrophic cancellation occurs when subtracting two nearly equal numbers. If $a = 1.000001$ and $b = 1.000000$ in float32, then $a - b = 0.000001$ . But if $a$ and $b$ each have 7 digits of precision, $a - b$ has only 1 digit of precision. The relative error explodes.

This is why computing $\mathrm{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$ is numerically unstable: if the mean is large relative to the standard deviation, you subtract two large, nearly equal numbers.

Why ML Uses Reduced Precision

Format	Sign	Exponent	Mantissa	Total bits	Precision	Range
float32	1	8	23	32	~7 digits	~ $10^{38}$
float16	1	5	10	16	~3 digits	~ $6.5 \times 10^4$
bfloat16	1	8	7	16	~2 digits	~ $10^{38}$
int8	-	-	-	8	256 values	$[-128, 127]$

float16 has limited range (max ~65504) but decent precision. Gradients and activations in deep networks can exceed 65504, causing overflow.

bfloat16 sacrifices precision for range. It has the same exponent width (and hence dynamic range) as float32, with only 7 mantissa bits, so overflow is rare and most kernels skip the loss-scaling machinery that float16 needs. On supported hardware (TPUs, recent GPUs) bfloat16 is often the default for training. float16 is still the right path on some accelerators, with proper loss scaling and tensor-core kernels.

int8 quantization maps continuous values to 256 discrete levels. Used mainly for inference to reduce memory and compute by 4x vs float32; the canonical references are Jacob et al. (2018) for integer-only inference and Dettmers et al. (2022) LLM.int8() for transformer-specific techniques. Int8 training is an active area but not the standard path.

Main Theorems

Proposition

Fundamental Axiom of Floating-Point Arithmetic

Statement

For any real number $x$ in the representable range, the floating-point representation $\mathrm{fl}(x)$ satisfies:

$\mathrm{fl}(x) = x(1 + \delta), \quad |\delta| \leq \epsilon_{\text{mach}}$

For any arithmetic operation $\circ \in \{+, -, \times, /\}$ :

$\mathrm{fl}(a \circ b) = (a \circ b)(1 + \delta), \quad |\delta| \leq \epsilon_{\text{mach}}$

Each floating-point operation introduces a relative error of at most $\epsilon_{\text{mach}}$ .

Intuition

Every floating-point operation is "almost right": the result is within a factor of $(1 \pm u)$ of the true answer, with $u = \epsilon_{\text{mach}} = 2^{-p}$ the unit roundoff. The problem is that errors accumulate over many operations. The standard worst-case bound for $n$ chained operations (Higham) uses

$\gamma_n = \frac{n u}{1 - n u},$

valid when $n u < 1$ ; for small $n u$ this is approximately $n u$ . For random, unbiased rounding errors that behave like independent zero-mean perturbations, a heuristic average-case scale closer to $\sqrt{n}\, u$ often holds, but this is a model assumption, not a theorem; pathological inputs and structured cancellation can defeat the heuristic.

Proof Sketch

The nearest floating-point number to $x$ differs from $x$ by at most half a ULP. The ULP of $x$ is $|x| \cdot 2^{1-p}$ where $p$ is the mantissa precision. So $|\mathrm{fl}(x) - x| \leq \frac{1}{2}|x| \cdot 2^{1-p} = |x| \cdot 2^{-p} = |x| \epsilon_{\text{mach}}$ . Dividing by $|x|$ gives the relative error bound.

Why It Matters

This axiom is the foundation of numerical analysis. Every error analysis in scientific computing starts from this bound. It tells you that single operations are accurate, and the challenge is controlling error accumulation over long computations (like training a neural network for millions of steps).

Failure Mode

The bound assumes no overflow or underflow. In the overflow regime (result exceeds max representable), you get infinity. In the underflow regime (result is too small), you lose relative accuracy because denormalized numbers have fewer significant bits.

report a correction →

Worked Examples of Precision Loss

Example

Catastrophic cancellation in variance computation

Compute the variance of $x = [10000.0, 10001.0, 10002.0]$ in float32.

Naive formula: $\text{Var}(x) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$ .

$\mathbb{E}[X] = 10001.0$ (exact in float32). $\mathbb{E}[X^2] = (10000^2 + 10001^2 + 10002^2)/3 = (10^8 + 100020001 + 100040004)/3 = 100020001.67$ .

In float32, $10000^2 = 10^8$ is exact, but $10001^2 = 100020001$ has 9 significant digits while float32 provides only 7. The stored value is approximately $1.00020000 \times 10^8$ , losing the last digit. Then $\mathbb{E}[X^2] - (\mathbb{E}[X])^2 \approx 100020001.67 - 100020001.0$ . Both numbers agree in their first 7 significant digits, so the subtraction produces a result with approximately 0 reliable digits. The computed variance might be 0.0, 1.0, or any value in between, depending on rounding.

The stable one-pass algorithm (Welford's method) accumulates $M_2 = \sum_i (x_i - \bar{x})^2$ incrementally, avoiding the subtraction of two large numbers. The differences $x_i - \bar{x}$ are small, so no cancellation occurs. Divide $M_2$ by $n$ (population variance) or $n-1$ (sample variance) at the end; for $x = [10000, 10001, 10002]$ these are $2/3$ and $1$ respectively. The naive formula above targets the population form $\mathbb{E}[X^2] - (\mathbb{E}[X])^2$ .

Example

Gradient accumulation in mixed precision training

In mixed-precision training, gradients are computed in float16 or bfloat16, then accumulated in float32. Consider adding a gradient of magnitude $10^{-4}$ to a running sum that has reached $10^4$ . In float16, the smallest representable change to $10^4$ is about $10^4 \times 2^{-10} \approx 9.8$ . Since $10^{-4} \ll 9.8$ , the addition $10^4 + 10^{-4}$ rounds back to $10^4$ . The small gradient is lost entirely.

In float32, the smallest representable change to $10^4$ is about $10^4 \times 2^{-23} \approx 1.2 \times 10^{-3}$ , which is still larger than $10^{-4}$ , so even float32 loses this gradient. The fix: use Kahan summation or accumulate in float64 for critical quantities like loss values.

This is why loss scaling is used in mixed-precision training: multiply the loss by a large constant (e.g., $2^{16}$ ) before backpropagation, which scales all gradients by the same factor, then divide by the scale factor when updating weights in float32. This shifts small gradients into the representable range of float16.

Canonical Examples

Example

Why 0.1 + 0.2 does not equal 0.3

The decimal number 0.1 has no exact binary representation (it is a repeating fraction in base 2, like 1/3 in base 10). In float64, $\mathrm{fl}(0.1) \approx 0.1000000000000000055511...$ , and similarly for 0.2. Their sum is $0.3000000000000000444...$ , which rounds to a different float64 than $\mathrm{fl}(0.3) = 0.2999999999999999888...$

Example

Log-sum-exp trick

Computing $\log(\sum_i e^{x_i})$ directly overflows if any $x_i$ is large. The log-sum-exp trick: let $c = \max_i x_i$ , then $\log(\sum_i e^{x_i}) = c + \log(\sum_i e^{x_i - c})$ . Now the largest exponent is $e^0 = 1$ , so no overflow. This is how softmax is computed in every deep learning framework.

Common Confusions

Watch Out

Floating-point is not interval arithmetic

Floating-point numbers are not uniformly spaced. Near zero they are dense; near the maximum they are sparse. The gap between $1.0$ and the next float32 is $\sim 10^{-7}$ ; the gap between $10^{30}$ and the next float32 is $\sim 10^{23}$ . Relative error is constant, but absolute error grows with magnitude.

Watch Out

Double precision does not fix all problems

Switching from float32 to float64 gives more precision but does not fix Unstable algorithms. If an algorithm amplifies errors by $10^{10}$ (ill-conditioned), float64 just delays the problem. Fix the algorithm, not the precision.

Summary

IEEE 754: $(-1)^s \times 1.m \times 2^{e-\text{bias}}$
Machine epsilon: worst-case relative error per operation; $\sim 10^{-7}$ for float32, $\sim 10^{-16}$ for float64
Overflow: number too large, becomes infinity
Underflow: number too small, becomes zero
Catastrophic cancellation: subtracting nearly equal numbers destroys precision
bfloat16 has float32's range but lower precision; preferred for training
Always use log-space arithmetic for products of probabilities

Exercises

ExerciseCore

Problem

In float32 ( $\epsilon_{\text{mach}} \approx 6 \times 10^{-8}$ ), you compute $a = 1.0000001$ and $b = 1.0000000$ . How many significant digits does $a - b = 0.0000001$ have?

ExerciseAdvanced

Problem

Explain why bfloat16 (8 exponent bits, 7 mantissa bits) is preferred over float16 (5 exponent bits, 10 mantissa bits) for training neural networks, despite having less precision.

References

Canonical:

Higham, Accuracy and Stability of Numerical Algorithms (2nd ed., SIAM 2002). The standard reference for the $(1+\delta)$ error model, $\gamma_n = nu/(1-nu)$ , and stability analysis of basic algorithms.
Goldberg, "What Every Computer Scientist Should Know About Floating-Point Arithmetic" (1991).
IEEE Standard 754-2019 for Floating-Point Arithmetic.

Current:

Micikevicius et al., "Mixed Precision Training" (ICLR 2018). FP16 + loss scaling.
Kalamkar et al., "A Study of BFLOAT16 for Deep Learning Training" (2019). bfloat16 exponent/range tradeoff.
Jacob et al., "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference" (CVPR 2018). The canonical int8 inference reference.
Dettmers et al., "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale" (NeurIPS 2022). Transformer-specific int8 inference.
Dettmers et al., "8-bit Optimizers via Block-wise Quantization" (ICLR 2022). 8-bit optimizer state for training memory savings — different problem from int8 inference.

Next Topics

The natural next steps from floating-point arithmetic:

Whitening and decorrelation: improving numerical conditioning of data
Numerical linear algebra: stable algorithms for solving linear systems

Last reviewed: April 13, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

0

No direct prerequisites are declared; this is treated as an entry point.

Derived topics

4

Numerical Stability and Conditioninglayer 1 · tier 1
Computer Architecture for MLlayer 2 · tier 2
Whitening and Decorrelationlayer 2 · tier 2
Mixed Precision Traininglayer 3 · tier 2

Graph-backed continuations

Whitening and Decorrelation Computer Architecture for ML Mixed Precision Training Numerical Stability and Conditioning