Skip to main content

LLM Construction

Quantization Theory

Reduce model weight precision from FP32 to FP16, INT8, or INT4. Post-training quantization, quantization-aware training, GPTQ, AWQ, and GGUF. Quantization is how large language models actually get deployed.

AdvancedTier 3CurrentSupporting~55 min

Why This Matters

A 70-billion parameter model in FP32 requires 280 GB of memory. That does not fit on any single GPU. In FP16 it is 140 GB. still too large for most setups. In INT4 it is 35 GB. now it fits on a single 40 GB A100 or even two consumer GPUs.

Quantization is not an optional optimization. It is the primary mechanism by which large models become deployable. Almost every LLM you have interacted with in production is running quantized. Understanding quantization means understanding the gap between what researchers train and what users actually run.

Mental Model

Quantization replaces high-precision floating point numbers with lower-precision integers. A weight stored as a 32-bit float has about 7 decimal digits of precision. An 8-bit integer has 256 possible values. A 4-bit integer has only 16 possible values. The question is: can you map a continuous range of weights to these few discrete values without destroying model quality?

The answer is mostly yes, with careful engineering. Weights in neural networks are remarkably robust to precision reduction because the function computed by the network depends on collective behavior of many weights, not on any single weight being exact.

Formal Setup and Notation

Definition

Uniform Quantization

Uniform quantization maps a real-valued weight ww to a bb-bit integer:

Q(w)=clamp(round(ws)+z,0,2b1)Q(w) = \text{clamp}\left(\text{round}\left(\frac{w}{s}\right) + z, \, 0, \, 2^b - 1\right)

where ss is the scale (step size) and zz is the zero point (integer offset). The dequantized value is w^=s(Q(w)z)\hat{w} = s(Q(w) - z). The quantization error is ww^s/2|w - \hat{w}| \leq s/2.

Definition

Symmetric vs Asymmetric Quantization

Symmetric quantization sets z=0z = 0 and maps [α,α][-\alpha, \alpha] to [2b1,2b11][-2^{b-1}, 2^{b-1}-1] with s=α/2b1s = \alpha / 2^{b-1}. It is simpler and faster but wastes a bit if the weight distribution is asymmetric.

Asymmetric quantization allows z0z \neq 0 and maps [wmin,wmax][w_{\min}, w_{\max}] to [0,2b1][0, 2^b - 1] with s=(wmaxwmin)/(2b1)s = (w_{\max} - w_{\min}) / (2^b - 1). It uses the full integer range but requires storing the zero point.

Definition

Per-Tensor vs Per-Channel Quantization

Per-tensor quantization uses a single scale ss and zero point zz for an entire weight matrix. Simple but coarse.

Per-channel quantization uses separate sc,zcs_c, z_c for each output channel (row of the weight matrix). This handles the common case where different channels have very different magnitude ranges.

Main Theorems

Proposition

Quantization Error Bound

Statement

For uniform bb-bit quantization of a weight matrix WRm×nW \in \mathbb{R}^{m \times n} with range [α,α][-\alpha, \alpha] and round-to-nearest:

WW^Fα2bmn\|W - \hat{W}\|_F \leq \frac{\alpha}{2^b} \sqrt{mn}

The per-element error is bounded by wijw^ijα/2b|w_{ij} - \hat{w}_{ij}| \leq \alpha / 2^b. The output error for input xx satisfies WxW^xαmn2bx\|Wx - \hat{W}x\| \leq \frac{\alpha \sqrt{mn}}{2^b} \|x\|.

Intuition

Each weight incurs at most half a step size of error. The total error scales with the number of weights and the range. Reducing from FP32 to INT8 (256 levels) means the per-weight error is roughly α/256\alpha/256, which is small enough that outputs barely change. Reducing to INT4 (16 levels) gives α/16\alpha/16, which requires more careful handling.

Proof Sketch

Round-to-nearest has error at most s/2=α/2bs/2 = \alpha/2^b per element. Frobenius norm squares: WW^F2mn(α/2b)2\|W - \hat{W}\|_F^2 \leq mn \cdot (\alpha/2^b)^2. Take square root. For the output bound, apply Cauchy-Schwarz.

Why It Matters

This bound explains the precision hierarchy: FP32 to FP16 loses almost nothing (error 104\sim 10^{-4}). FP16 to INT8 is usually safe (error 102\sim 10^{-2}). INT8 to INT4 is where quality visibly degrades unless you use smarter methods than round-to-nearest.

Failure Mode

The bound assumes uniform weight distribution. Real neural network weights are not uniform: they often have outlier channels with magnitudes 10-100x larger than typical channels. Uniform quantization wastes most of its integer range on the empty space around outliers. This is why per-channel quantization and outlier-aware methods are necessary.

Proposition

Outlier Channels Dominate Quantization Error

Statement

In transformer models, a small fraction of channels (often fewer than 1%) contain activation outliers with magnitudes 10-100x larger than typical channels. Under per-tensor quantization, these outlier channels force the scale ss to be large, reducing effective precision for all other channels. The quantization error is dominated by the non-outlier channels receiving fewer effective bits.

Intuition

If one channel has weights in [100,100][-100, 100] and the rest are in [1,1][-1, 1], per-tensor INT8 quantization uses scale s=200/2550.78s = 200/255 \approx 0.78. The typical weights get quantized to just 2-3 distinct values instead of using the full 256 levels. The outlier channel is fine, but everything else is destroyed.

Proof Sketch

Let αmax\alpha_{\max} be the outlier range and αtyp\alpha_{\text{typ}} be the typical range. Per-tensor scale is s=2αmax/(2b1)s = 2\alpha_{\max}/(2^b - 1). The effective number of bits for typical channels is log2(2αtyp/s)=b+log2(αtyp/αmax)\log_2(2\alpha_{\text{typ}}/s) = b + \log_2(\alpha_{\text{typ}}/\alpha_{\max}). When αmax/αtyp=100\alpha_{\max}/\alpha_{\text{typ}} = 100, effective bits drop by log2(100)6.6\log_2(100) \approx 6.6, leaving fewer than 2 effective bits for INT8.

Why It Matters

This observation motivates every modern quantization method: LLM.int8() handles outliers in FP16, GPTQ uses second-order information to compensate, and AWQ protects salient channels. Understanding the outlier problem is essential for understanding why naive quantization fails for LLMs.

Failure Mode

The outlier pattern varies across layers and models. A fixed strategy (e.g., always quantizing the top 1% in FP16) may not be optimal. Calibration data is needed to identify which channels are critical.

Quantization Methods

Definition

Post-Training Quantization (PTQ)

PTQ quantizes a pretrained model without any retraining. Steps:

  1. Choose quantization scheme (per-tensor/per-channel, symmetric/asymmetric)
  2. Run calibration data through the model to determine optimal scales
  3. Quantize weights (and optionally activations)

Advantages: fast, no training required. Disadvantage: quality degrades at low bit-widths (INT4) without compensation.

Definition

Quantization-Aware Training (QAT)

QAT simulates quantization during training. The forward pass uses quantized weights (via straight-through estimator for gradients). The model learns to be robust to quantization noise during training.

Advantages: consistently better quality than PTQ at low bit-widths. Disadvantage: requires full retraining, which is prohibitively expensive for large models.

Definition

GPTQ

GPTQ (Frantar et al., 2022) quantizes weights one column at a time, using second-order (Hessian) information to optimally adjust remaining weights to compensate for quantization error. Based on the OBQ (Optimal Brain Quantization) framework. Key: the Hessian H=2XXH = 2X^\top X comes from calibration data, making the compensation data-dependent.

Definition

AWQ (Activation-Aware Weight Quantization)

AWQ (Lin et al., 2023) observes that a small fraction of weight channels are disproportionately important (those multiplied by large activations). It scales these salient channels up before quantization and scales them back during inference, effectively giving important channels more quantization precision.

Definition

GGUF Format

GGUF (GPT-Generated Unified Format) is a file format for storing quantized models, used by llama.cpp and related inference engines. It supports mixed-precision quantization (different layers at different bit-widths), metadata, and multiple quantization types (Q4_0, Q4_K_M, Q5_K_S, etc.). The K-quant variants use importance-based mixed precision within each layer.

Activation and KV-Cache Quantization

Weight-only quantization is only half of the deployment story. Activations and the KV cache also occupy GPU memory, and at long context lengths the KV cache is typically the dominant memory cost, not the weights.

Definition

SmoothQuant

SmoothQuant (Xiao et al. 2023, arXiv:2211.10438) handles activation outliers by rewriting the matmul XWX W as

XW=(Xdiag(s)1)(diag(s)W)X W = (X \, \text{diag}(s)^{-1})(\text{diag}(s) \, W)

and choosing ss so that the activation factor Xdiag(s)1X \, \text{diag}(s)^{-1} has a much flatter magnitude profile per channel, at the cost of making the weight factor diag(s)W\text{diag}(s) \, W slightly less uniform. A typical choice is sj=maxiXijα/maxiWji1αs_j = \max_i |X_{ij}|^{\alpha} / \max_i |W_{ji}|^{1-\alpha} with α0.5\alpha \approx 0.5. The transform is algebraically exact in FP16 and simply shifts where the quantization error lands.

SmoothQuant is the standard way to enable W8A8 inference (8-bit weights AND 8-bit activations) on LLMs. Without it, activation outliers force the activation scale to be so coarse that quality collapses.

Definition

KV Cache Quantization

At inference time, each attention layer caches the keys and values for every prior token. For a model with LL layers, HH heads, head dim dhd_h, context length TT, batch BB, and FP16 storage, the KV cache consumes 2LHdhTB22 \cdot L \cdot H \cdot d_h \cdot T \cdot B \cdot 2 bytes. For Llama-3-70B at T=32768T = 32768, this is tens of gigabytes per request, often exceeding the weight memory.

KV cache quantization stores K and V in INT8 or INT4. Two representative methods:

  • KVQuant (Hooper et al. 2024, arXiv:2401.18079) uses per-channel keys, per-token values, non-uniform quantization, and pre-RoPE quantization to hit 3-bit KV with sub-0.1 perplexity loss on Llama-2 at 128k context.
  • KIVI (Liu et al. 2024, arXiv:2402.02750) is a tuning-free 2-bit KV scheme with per-channel K and per-token V, grouped quantization for recent tokens, and a full-precision residual buffer.

These are deployment-critical for long-context inference. Serving frameworks (vLLM, TensorRT-LLM, SGLang) ship KV quantization as a first-class feature.

Modern Hardware Formats

The hardware frontier has moved past integer quantization. NVIDIA Hopper (H100, 2022) and Blackwell (B200, 2024) provide native low-precision floating point formats with dedicated tensor cores. AMD MI300 and Intel Gaudi follow the same spec.

Definition

FP8 (E4M3 and E5M2)

FP8 is an 8-bit floating point format standardized by Micikevicius et al. (2022, arXiv:2209.05433) and adopted by the OCP MX working group. Two variants trade range for precision:

  • E4M3: 1 sign + 4 exponent + 3 mantissa bits. Max magnitude 448\approx 448, used for activations and weights in the forward pass.
  • E5M2: 1 sign + 5 exponent + 2 mantissa bits. Max magnitude 57344\approx 57344, matching FP16 range, used for gradients in the backward pass.

FP8 pretraining is now production-standard on H100 and B200. DeepSeek-V3 (2024) trained the full 671B-parameter MoE in mixed FP8 end-to-end, which was the first major public demonstration of FP8 pretraining at frontier scale.

Definition

MX Formats (Microscaling)

The OCP Microscaling spec (2023) defines MXFP8, MXFP6, MXFP4, MXINT8: block-scaled formats where every 32 consecutive elements share a power-of-two scale factor stored in 8 bits, and each element uses the specified 4/6/8-bit type. The shared scale lets 4-bit and 6-bit formats handle per-tensor dynamic range that pure INT4 cannot.

MXFP4 is Blackwell's native format for low-precision inference. NVIDIA's Transformer Engine and OCP reference implementations report that MXFP4 inference recovers most of FP16's accuracy on LLMs while doubling throughput versus FP8. As of 2026, MXFP4 is becoming the default for low-precision deployment at frontier scale.

The Low-Bit Weight-Only Frontier

Below 4 bits, uniform RTN quantization breaks down and calibration-aware methods become necessary. Three research threads matter:

  • Incoherence processing (QuIP line). QuIP (Chee et al. 2023, arXiv:2307.13304) applies random orthogonal rotations to weights and the Hessian so outliers are spread across coordinates, which makes uniform quantization efficient. QuIP# (Tseng et al. 2024, arXiv:2402.04396) swaps random rotations for Hadamard transforms (essentially free) and replaces scalar codebooks with lattice codebooks (E8E_8 lattice at 2 bits, Golay code at 3 bits). QTIP (Tseng et al. 2024, NeurIPS, arXiv:2406.11235) replaces codebook lookup with a trellis decoder so large effective codebooks cost constant state memory. YAQA (Tseng et al. 2025, arXiv:2505.22988) is the same group's follow-up, using Kronecker-factored estimates of the full-model Hessian (not just per-layer proxy Hessians) to directly minimize end-to-end KL to the unquantized model. YAQA reports roughly a 30% KL reduction versus QTIP at the same bit-width.

  • Additive and vector quantization. AQLM (Egiazarian et al. 2024, arXiv:2401.06118) represents weight blocks as sums of vectors drawn from learned codebooks and fine-tunes the codebooks on calibration data. Reports Llama-2-70B at 2-bit with under 1 perplexity point of loss, which was 2-bit SOTA at release. GPTVQ (van Baalen et al. 2024, arXiv:2402.15319) and SqueezeLLM (Kim et al. 2024, arXiv:2306.07629) pursue related vector / non-uniform directions.

  • Calibration-free and hybrid methods. HQQ (Half-Quadratic Quantization, Badri and Shaji 2023) fits per-group zero-points via a half-quadratic splitting instead of Hessian-based calibration. It is seconds to minutes to apply even on 70B-parameter models, which is why it is the default in HuggingFace Transformers and MLX. OmniQuant (Shao et al. 2024, arXiv:2308.13137) learns equivalent transformations and weight clipping jointly.

Sub-Bit Training: BitNet

A separate line of work asks whether the full-precision model is necessary at all.

Definition

BitNet b1.58

BitNet b1.58 (Ma et al. 2024, arXiv:2402.17764) trains transformer language models from scratch with ternary weights {1,0,+1}\{-1, 0, +1\} and 8-bit activations. The "1.58" refers to log231.585\log_2 3 \approx 1.585 bits per weight.

Reported results: at 3B parameters and beyond, BitNet b1.58 matches the perplexity and end-task accuracy of FP16 Llama at the same parameter count, while cutting energy and memory by large factors. The update rule replaces matmul with addition (since weights are in {1,0,+1}\{-1, 0, +1\}), which is hardware-friendly on accelerators without fast FP tensor cores.

The epistemic significance is larger than the immediate deployment story: BitNet suggests that FP16/FP32 precision during training is needed for gradient flow, not for the representational capacity of the converged model. This reframes PTQ (GPTQ, AWQ, QuIP) as working around suboptimal training rather than fighting a fundamental precision floor.

Canonical Examples

Example

INT8 quantization of a 7B model

A 7B parameter model in FP16 requires 14 GB. In INT8 it requires 7 GB. With per-channel symmetric quantization and round-to-nearest, the perplexity increase on typical benchmarks is less than 0.5%. The memory savings are 2x with negligible quality loss. This is the easy case.

Example

INT4 quantization requires compensation

The same 7B model in INT4 (3.5 GB) with naive round-to-nearest shows 5-10% perplexity degradation. GPTQ reduces this to about 1% by using Hessian-based weight adjustment. AWQ achieves similar quality by protecting salient channels. The lesson: below INT8, naive quantization is not enough.

Common Confusions

Watch Out

Quantization is not the same as pruning

Quantization reduces the precision of every weight. Pruning removes weights entirely (sets them to zero). They are complementary: you can quantize a pruned model. Quantization preserves the model architecture while pruning changes the effective architecture.

Watch Out

Lower bits does not always mean faster inference

INT4 operations are not natively supported on all hardware. On GPUs without INT4 tensor cores, INT4 weights are dequantized to FP16 on the fly. The benefit is memory savings (fitting the model on fewer GPUs), not necessarily faster computation per token.

Summary

  • Quantization reduces precision: FP32 -> FP16 -> INT8 -> INT4
  • Per-element error bounded by half the step size
  • Outlier channels dominate error under per-tensor quantization
  • PTQ is fast but degrades at low bits; QAT is better but expensive
  • GPTQ uses Hessian information to compensate for quantization error
  • AWQ protects salient channels via activation-aware scaling
  • GGUF is the standard format for deploying quantized LLMs locally

Exercises

ExerciseCore

Problem

A weight matrix has values in [2,2][-2, 2]. You apply symmetric INT8 quantization. What is the step size ss? What is the maximum per-element quantization error? If the matrix is 4096×40964096 \times 4096, what is the bound on WW^F\|W - \hat{W}\|_F?

ExerciseAdvanced

Problem

Explain why GPTQ quantizes weights column by column rather than all at once. What role does the Hessian H=XXH = X^\top X play in compensating for quantization error?

Related Comparisons

References

Canonical:

  • Jacob et al., "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference" (2018)
  • Dettmers et al., "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale" (2022), arXiv:2208.07339

Weight-only PTQ (GPTQ / AWQ lineage):

  • Frantar and Alistarh, "Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning" (OBQ, 2022), arXiv:2208.11580
  • Frantar et al., "GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers" (2022), arXiv:2210.17323
  • Lin et al., "AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration" (2023), arXiv:2306.00978
  • Badri and Shaji, "Half-Quadratic Quantization (HQQ)" (2023), mobiusml blog/code release
  • Shao et al., "OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models" (2024), arXiv:2308.13137

Activation and KV-cache quantization:

  • Xiao et al., "SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models" (2023), arXiv:2211.10438
  • Hooper et al., "KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization" (2024), arXiv:2401.18079
  • Liu et al., "KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache" (2024), arXiv:2402.02750

FP8 pretraining:

  • DeepSeek-AI, "DeepSeek-V3 Technical Report" (2024), arXiv:2412.19437. First public demonstration of full FP8 pretraining at frontier scale (671B MoE, 14.8T tokens).

Hardware formats:

  • Micikevicius et al., "FP8 Formats for Deep Learning" (2022), arXiv:2209.05433
  • Open Compute Project, "Microscaling Formats (MX) v1.0 Specification" (2023)

Low-bit weight-only frontier (QuIP / vector / additive):

  • Chee et al., "QuIP: 2-Bit Quantization of Large Language Models With Guarantees" (2023), arXiv:2307.13304
  • Tseng et al., "QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks" (2024), arXiv:2402.04396
  • Tseng et al., "QTIP: Quantization with Trellises and Incoherence Processing" (NeurIPS 2024 Spotlight), arXiv:2406.11235. Uses trellis coded quantization to separate codebook size from bitrate via a stateful decoder.
  • Tseng et al., "YAQA: Yet Another Quantization Algorithm" (2025), arXiv:2505.22988. Uses Kronecker-factored full-model Hessian estimates to minimize end-to-end KL rather than per-layer proxy loss; reports ~30% KL reduction over QTIP at matched bit-width.
  • Egiazarian et al., "AQLM: Extreme Compression of Large Language Models via Additive Quantization" (2024), arXiv:2401.06118
  • Kim et al., "SqueezeLLM: Dense-and-Sparse Quantization" (2024), arXiv:2306.07629
  • van Baalen et al., "GPTVQ: The Blessing of Dimensionality for LLM Quantization" (2024), arXiv:2402.15319

Sub-bit training:

  • Ma et al., "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits" (2024), arXiv:2402.17764 (BitNet b1.58)

Next Topics

Natural extensions from quantization:

Last reviewed: April 17, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

6

Derived topics

0

No published topic currently declares this as a prerequisite.