LLM Construction
Quantization Theory
Reduce model weight precision from FP32 to FP16, INT8, or INT4. Post-training quantization, quantization-aware training, GPTQ, AWQ, and GGUF. Quantization is how large language models actually get deployed.
Prerequisites
Why This Matters
A 70-billion parameter model in FP32 requires 280 GB of memory. That does not fit on any single GPU. In FP16 it is 140 GB. still too large for most setups. In INT4 it is 35 GB. now it fits on a single 40 GB A100 or even two consumer GPUs.
Quantization is not an optional optimization. It is the primary mechanism by which large models become deployable. Almost every LLM you have interacted with in production is running quantized. Understanding quantization means understanding the gap between what researchers train and what users actually run.
Mental Model
Quantization replaces high-precision floating point numbers with lower-precision integers. A weight stored as a 32-bit float has about 7 decimal digits of precision. An 8-bit integer has 256 possible values. A 4-bit integer has only 16 possible values. The question is: can you map a continuous range of weights to these few discrete values without destroying model quality?
The answer is mostly yes, with careful engineering. Weights in neural networks are remarkably robust to precision reduction because the function computed by the network depends on collective behavior of many weights, not on any single weight being exact.
Formal Setup and Notation
Uniform Quantization
Uniform quantization maps a real-valued weight to a -bit integer:
where is the scale (step size) and is the zero point (integer offset). The dequantized value is . The quantization error is .
Symmetric vs Asymmetric Quantization
Symmetric quantization sets and maps to with . It is simpler and faster but wastes a bit if the weight distribution is asymmetric.
Asymmetric quantization allows and maps to with . It uses the full integer range but requires storing the zero point.
Per-Tensor vs Per-Channel Quantization
Per-tensor quantization uses a single scale and zero point for an entire weight matrix. Simple but coarse.
Per-channel quantization uses separate for each output channel (row of the weight matrix). This handles the common case where different channels have very different magnitude ranges.
Main Theorems
Quantization Error Bound
Statement
For uniform -bit quantization of a weight matrix with range and round-to-nearest:
The per-element error is bounded by . The output error for input satisfies .
Intuition
Each weight incurs at most half a step size of error. The total error scales with the number of weights and the range. Reducing from FP32 to INT8 (256 levels) means the per-weight error is roughly , which is small enough that outputs barely change. Reducing to INT4 (16 levels) gives , which requires more careful handling.
Proof Sketch
Round-to-nearest has error at most per element. Frobenius norm squares: . Take square root. For the output bound, apply Cauchy-Schwarz.
Why It Matters
This bound explains the precision hierarchy: FP32 to FP16 loses almost nothing (error ). FP16 to INT8 is usually safe (error ). INT8 to INT4 is where quality visibly degrades unless you use smarter methods than round-to-nearest.
Failure Mode
The bound assumes uniform weight distribution. Real neural network weights are not uniform: they often have outlier channels with magnitudes 10-100x larger than typical channels. Uniform quantization wastes most of its integer range on the empty space around outliers. This is why per-channel quantization and outlier-aware methods are necessary.
Outlier Channels Dominate Quantization Error
Statement
In transformer models, a small fraction of channels (often fewer than 1%) contain activation outliers with magnitudes 10-100x larger than typical channels. Under per-tensor quantization, these outlier channels force the scale to be large, reducing effective precision for all other channels. The quantization error is dominated by the non-outlier channels receiving fewer effective bits.
Intuition
If one channel has weights in and the rest are in , per-tensor INT8 quantization uses scale . The typical weights get quantized to just 2-3 distinct values instead of using the full 256 levels. The outlier channel is fine, but everything else is destroyed.
Proof Sketch
Let be the outlier range and be the typical range. Per-tensor scale is . The effective number of bits for typical channels is . When , effective bits drop by , leaving fewer than 2 effective bits for INT8.
Why It Matters
This observation motivates every modern quantization method: LLM.int8() handles outliers in FP16, GPTQ uses second-order information to compensate, and AWQ protects salient channels. Understanding the outlier problem is essential for understanding why naive quantization fails for LLMs.
Failure Mode
The outlier pattern varies across layers and models. A fixed strategy (e.g., always quantizing the top 1% in FP16) may not be optimal. Calibration data is needed to identify which channels are critical.
Quantization Methods
Post-Training Quantization (PTQ)
PTQ quantizes a pretrained model without any retraining. Steps:
- Choose quantization scheme (per-tensor/per-channel, symmetric/asymmetric)
- Run calibration data through the model to determine optimal scales
- Quantize weights (and optionally activations)
Advantages: fast, no training required. Disadvantage: quality degrades at low bit-widths (INT4) without compensation.
Quantization-Aware Training (QAT)
QAT simulates quantization during training. The forward pass uses quantized weights (via straight-through estimator for gradients). The model learns to be robust to quantization noise during training.
Advantages: consistently better quality than PTQ at low bit-widths. Disadvantage: requires full retraining, which is prohibitively expensive for large models.
GPTQ
GPTQ (Frantar et al., 2022) quantizes weights one column at a time, using second-order (Hessian) information to optimally adjust remaining weights to compensate for quantization error. Based on the OBQ (Optimal Brain Quantization) framework. Key: the Hessian comes from calibration data, making the compensation data-dependent.
AWQ (Activation-Aware Weight Quantization)
AWQ (Lin et al., 2023) observes that a small fraction of weight channels are disproportionately important (those multiplied by large activations). It scales these salient channels up before quantization and scales them back during inference, effectively giving important channels more quantization precision.
GGUF Format
GGUF (GPT-Generated Unified Format) is a file format for storing quantized models, used by llama.cpp and related inference engines. It supports mixed-precision quantization (different layers at different bit-widths), metadata, and multiple quantization types (Q4_0, Q4_K_M, Q5_K_S, etc.). The K-quant variants use importance-based mixed precision within each layer.
Activation and KV-Cache Quantization
Weight-only quantization is only half of the deployment story. Activations and the KV cache also occupy GPU memory, and at long context lengths the KV cache is typically the dominant memory cost, not the weights.
SmoothQuant
SmoothQuant (Xiao et al. 2023, arXiv:2211.10438) handles activation outliers by rewriting the matmul as
and choosing so that the activation factor has a much flatter magnitude profile per channel, at the cost of making the weight factor slightly less uniform. A typical choice is with . The transform is algebraically exact in FP16 and simply shifts where the quantization error lands.
SmoothQuant is the standard way to enable W8A8 inference (8-bit weights AND 8-bit activations) on LLMs. Without it, activation outliers force the activation scale to be so coarse that quality collapses.
KV Cache Quantization
At inference time, each attention layer caches the keys and values for every prior token. For a model with layers, heads, head dim , context length , batch , and FP16 storage, the KV cache consumes bytes. For Llama-3-70B at , this is tens of gigabytes per request, often exceeding the weight memory.
KV cache quantization stores K and V in INT8 or INT4. Two representative methods:
- KVQuant (Hooper et al. 2024, arXiv:2401.18079) uses per-channel keys, per-token values, non-uniform quantization, and pre-RoPE quantization to hit 3-bit KV with sub-0.1 perplexity loss on Llama-2 at 128k context.
- KIVI (Liu et al. 2024, arXiv:2402.02750) is a tuning-free 2-bit KV scheme with per-channel K and per-token V, grouped quantization for recent tokens, and a full-precision residual buffer.
These are deployment-critical for long-context inference. Serving frameworks (vLLM, TensorRT-LLM, SGLang) ship KV quantization as a first-class feature.
Modern Hardware Formats
The hardware frontier has moved past integer quantization. NVIDIA Hopper (H100, 2022) and Blackwell (B200, 2024) provide native low-precision floating point formats with dedicated tensor cores. AMD MI300 and Intel Gaudi follow the same spec.
FP8 (E4M3 and E5M2)
FP8 is an 8-bit floating point format standardized by Micikevicius et al. (2022, arXiv:2209.05433) and adopted by the OCP MX working group. Two variants trade range for precision:
- E4M3: 1 sign + 4 exponent + 3 mantissa bits. Max magnitude , used for activations and weights in the forward pass.
- E5M2: 1 sign + 5 exponent + 2 mantissa bits. Max magnitude , matching FP16 range, used for gradients in the backward pass.
FP8 pretraining is now production-standard on H100 and B200. DeepSeek-V3 (2024) trained the full 671B-parameter MoE in mixed FP8 end-to-end, which was the first major public demonstration of FP8 pretraining at frontier scale.
MX Formats (Microscaling)
The OCP Microscaling spec (2023) defines MXFP8, MXFP6, MXFP4, MXINT8: block-scaled formats where every 32 consecutive elements share a power-of-two scale factor stored in 8 bits, and each element uses the specified 4/6/8-bit type. The shared scale lets 4-bit and 6-bit formats handle per-tensor dynamic range that pure INT4 cannot.
MXFP4 is Blackwell's native format for low-precision inference. NVIDIA's Transformer Engine and OCP reference implementations report that MXFP4 inference recovers most of FP16's accuracy on LLMs while doubling throughput versus FP8. As of 2026, MXFP4 is becoming the default for low-precision deployment at frontier scale.
The Low-Bit Weight-Only Frontier
Below 4 bits, uniform RTN quantization breaks down and calibration-aware methods become necessary. Three research threads matter:
-
Incoherence processing (QuIP line). QuIP (Chee et al. 2023, arXiv:2307.13304) applies random orthogonal rotations to weights and the Hessian so outliers are spread across coordinates, which makes uniform quantization efficient. QuIP# (Tseng et al. 2024, arXiv:2402.04396) swaps random rotations for Hadamard transforms (essentially free) and replaces scalar codebooks with lattice codebooks ( lattice at 2 bits, Golay code at 3 bits). QTIP (Tseng et al. 2024, NeurIPS, arXiv:2406.11235) replaces codebook lookup with a trellis decoder so large effective codebooks cost constant state memory. YAQA (Tseng et al. 2025, arXiv:2505.22988) is the same group's follow-up, using Kronecker-factored estimates of the full-model Hessian (not just per-layer proxy Hessians) to directly minimize end-to-end KL to the unquantized model. YAQA reports roughly a 30% KL reduction versus QTIP at the same bit-width.
-
Additive and vector quantization. AQLM (Egiazarian et al. 2024, arXiv:2401.06118) represents weight blocks as sums of vectors drawn from learned codebooks and fine-tunes the codebooks on calibration data. Reports Llama-2-70B at 2-bit with under 1 perplexity point of loss, which was 2-bit SOTA at release. GPTVQ (van Baalen et al. 2024, arXiv:2402.15319) and SqueezeLLM (Kim et al. 2024, arXiv:2306.07629) pursue related vector / non-uniform directions.
-
Calibration-free and hybrid methods. HQQ (Half-Quadratic Quantization, Badri and Shaji 2023) fits per-group zero-points via a half-quadratic splitting instead of Hessian-based calibration. It is seconds to minutes to apply even on 70B-parameter models, which is why it is the default in HuggingFace Transformers and MLX. OmniQuant (Shao et al. 2024, arXiv:2308.13137) learns equivalent transformations and weight clipping jointly.
Sub-Bit Training: BitNet
A separate line of work asks whether the full-precision model is necessary at all.
BitNet b1.58
BitNet b1.58 (Ma et al. 2024, arXiv:2402.17764) trains transformer language models from scratch with ternary weights and 8-bit activations. The "1.58" refers to bits per weight.
Reported results: at 3B parameters and beyond, BitNet b1.58 matches the perplexity and end-task accuracy of FP16 Llama at the same parameter count, while cutting energy and memory by large factors. The update rule replaces matmul with addition (since weights are in ), which is hardware-friendly on accelerators without fast FP tensor cores.
The epistemic significance is larger than the immediate deployment story: BitNet suggests that FP16/FP32 precision during training is needed for gradient flow, not for the representational capacity of the converged model. This reframes PTQ (GPTQ, AWQ, QuIP) as working around suboptimal training rather than fighting a fundamental precision floor.
Canonical Examples
INT8 quantization of a 7B model
A 7B parameter model in FP16 requires 14 GB. In INT8 it requires 7 GB. With per-channel symmetric quantization and round-to-nearest, the perplexity increase on typical benchmarks is less than 0.5%. The memory savings are 2x with negligible quality loss. This is the easy case.
INT4 quantization requires compensation
The same 7B model in INT4 (3.5 GB) with naive round-to-nearest shows 5-10% perplexity degradation. GPTQ reduces this to about 1% by using Hessian-based weight adjustment. AWQ achieves similar quality by protecting salient channels. The lesson: below INT8, naive quantization is not enough.
Common Confusions
Quantization is not the same as pruning
Quantization reduces the precision of every weight. Pruning removes weights entirely (sets them to zero). They are complementary: you can quantize a pruned model. Quantization preserves the model architecture while pruning changes the effective architecture.
Lower bits does not always mean faster inference
INT4 operations are not natively supported on all hardware. On GPUs without INT4 tensor cores, INT4 weights are dequantized to FP16 on the fly. The benefit is memory savings (fitting the model on fewer GPUs), not necessarily faster computation per token.
Summary
- Quantization reduces precision: FP32 -> FP16 -> INT8 -> INT4
- Per-element error bounded by half the step size
- Outlier channels dominate error under per-tensor quantization
- PTQ is fast but degrades at low bits; QAT is better but expensive
- GPTQ uses Hessian information to compensate for quantization error
- AWQ protects salient channels via activation-aware scaling
- GGUF is the standard format for deploying quantized LLMs locally
Exercises
Problem
A weight matrix has values in . You apply symmetric INT8 quantization. What is the step size ? What is the maximum per-element quantization error? If the matrix is , what is the bound on ?
Problem
Explain why GPTQ quantizes weights column by column rather than all at once. What role does the Hessian play in compensating for quantization error?
Related Comparisons
References
Canonical:
- Jacob et al., "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference" (2018)
- Dettmers et al., "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale" (2022), arXiv:2208.07339
Weight-only PTQ (GPTQ / AWQ lineage):
- Frantar and Alistarh, "Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning" (OBQ, 2022), arXiv:2208.11580
- Frantar et al., "GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers" (2022), arXiv:2210.17323
- Lin et al., "AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration" (2023), arXiv:2306.00978
- Badri and Shaji, "Half-Quadratic Quantization (HQQ)" (2023), mobiusml blog/code release
- Shao et al., "OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models" (2024), arXiv:2308.13137
Activation and KV-cache quantization:
- Xiao et al., "SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models" (2023), arXiv:2211.10438
- Hooper et al., "KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization" (2024), arXiv:2401.18079
- Liu et al., "KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache" (2024), arXiv:2402.02750
FP8 pretraining:
- DeepSeek-AI, "DeepSeek-V3 Technical Report" (2024), arXiv:2412.19437. First public demonstration of full FP8 pretraining at frontier scale (671B MoE, 14.8T tokens).
Hardware formats:
- Micikevicius et al., "FP8 Formats for Deep Learning" (2022), arXiv:2209.05433
- Open Compute Project, "Microscaling Formats (MX) v1.0 Specification" (2023)
Low-bit weight-only frontier (QuIP / vector / additive):
- Chee et al., "QuIP: 2-Bit Quantization of Large Language Models With Guarantees" (2023), arXiv:2307.13304
- Tseng et al., "QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks" (2024), arXiv:2402.04396
- Tseng et al., "QTIP: Quantization with Trellises and Incoherence Processing" (NeurIPS 2024 Spotlight), arXiv:2406.11235. Uses trellis coded quantization to separate codebook size from bitrate via a stateful decoder.
- Tseng et al., "YAQA: Yet Another Quantization Algorithm" (2025), arXiv:2505.22988. Uses Kronecker-factored full-model Hessian estimates to minimize end-to-end KL rather than per-layer proxy loss; reports ~30% KL reduction over QTIP at matched bit-width.
- Egiazarian et al., "AQLM: Extreme Compression of Large Language Models via Additive Quantization" (2024), arXiv:2401.06118
- Kim et al., "SqueezeLLM: Dense-and-Sparse Quantization" (2024), arXiv:2306.07629
- van Baalen et al., "GPTVQ: The Blessing of Dimensionality for LLM Quantization" (2024), arXiv:2402.15319
Sub-bit training:
- Ma et al., "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits" (2024), arXiv:2402.17764 (BitNet b1.58)
Next Topics
Natural extensions from quantization:
- Mixture of experts: another approach to reducing inference compute
- KV cache optimization: quantizing the key-value cache for memory savings
- KV cache: the underlying memory structure that KV quantization compresses
Last reviewed: April 17, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
6- Eigenvalues and Eigenvectorslayer 0A · tier 1
- Matrix Operations and Propertieslayer 0A · tier 1
- Matrix Calculuslayer 1 · tier 1
- Softmax and Numerical Stabilitylayer 1 · tier 1
- Feedforward Networks and Backpropagationlayer 2 · tier 1
Derived topics
0No published topic currently declares this as a prerequisite.