Paper breakdown

QTIP: Quantization with Trellises and Incoherence Processing

Albert Tseng, Qingyao Sun, David Hou, and Christopher De Sa · 2024 · NeurIPS 2024

Post-training quantization that decouples codebook size from bitrate by replacing vector quantization with trellis-coded quantization. Uses random Hadamard incoherence processing to make LLM weights approximately i.i.d. Gaussian, then quantizes 256-dimensional Gaussian sequences via a Viterbi-decoded bitshift trellis. Two-bit Llama-2-70B reaches within 0.7 perplexity of FP16.

arXiv:2406.11235

Overview

Tseng et al. (2024) introduce QTIP, a post-training quantization (PTQ) method for large language model weights that combines two ideas: trellis-coded quantization (TCQ) instead of vector quantization (VQ), and incoherence processing via random orthogonal transforms to standardize the weight distribution before coding. The motivation is simple. LLM inference is memory-bandwidth bound: a 70B-parameter model in FP16 needs 140 GB just to load weights, and decoding throughput is throttled by how fast the weight tile can stream from HBM into SRAM. Halving the precision halves the memory traffic and roughly doubles the throughput. Two-bit weights would multiply throughput by eight at the cost of $\le 2^2 = 4$ representable levels per scalar, which is far too coarse for direct scalar quantization.

Vector quantization mitigates this by quantizing groups of $d$ scalars at a time to one of $2^{kd}$ codebook vectors. The codebook is a learned set of points in $\mathbb{R}^d$ , chosen to fit the weight distribution. Higher $d$ means better packing density but the codebook size $2^{kd}$ explodes; the prior state-of-the-art (QuIP#, AQLM) is stuck at $d \le 8$ because the codebook must fit in L1 cache for fast inference. QTIP replaces the unstructured codebook with a stateful trellis decoder, so the effective codebook is the set of length- $T$ walks on a small directed graph. This decouples the bitrate (bits per scalar) from the codebook size and from the effective dimension, allowing $d > 100$ at no cache cost.

The result is that two-bit Llama-2-70B with QTIP reaches Wikitext2 perplexity $3.87$ versus FP16 at $3.12$ — about half the gap of QuIP# ( $4.16$ ) or AQLM ( $3.83$ , slower decoder). Three- and four-bit variants are within $\sim 0.05$ of FP16 across model sizes. The decoder runs at $188$ tokens/s for 7B and $23.5$ tokens/s for 70B at batch one on an RTX 6000 Ada — strictly faster than QuIP# at the same bitrate and dramatically faster than AQLM.

Mathematical Contributions

The per-layer proxy loss

Quantizing the entire network end-to-end is intractable for a 70B model. Following the standard PTQ template (Nagel et al., 2020), QTIP minimizes the per-layer reconstruction error in input-activated form:

$\ell(\hat W) = \mathbb{E}_x\!\left[\|(\hat W - W)x\|^2\right] = \mathrm{tr}\!\left((\hat W - W)\, H\, (\hat W - W)^\top\right)$

where $W \in \mathbb{R}^{m\times n}$ is the FP16 weight matrix, $\hat W$ is its quantized counterpart, and $H = \mathbb{E}_x[xx^\top] \in \mathbb{R}^{n\times n}$ is the input second-moment matrix, computed from a calibration set of a few hundred forward passes. The Hessian $H$ tells the quantizer which directions matter: rounding error in eigen-directions of $H$ with large eigenvalues hurts loss the most, error in the null-space of $H$ is free.

Incoherence processing

The Hessian and the weight matrix are not Gaussian or even rotationally symmetric; they have outliers, heavy tails, and structured spikes that hurt vector-quantization shapes built for i.i.d. data. Chee et al. (2024, QuIP#) proposed a fix: pre-multiply $W$ and $H$ by random orthogonal matrices to flatten the distribution.

Definition 2.1 ( $\mu$ -incoherence). A matrix $H \in \mathbb{R}^{n\times n}$ with eigendecomposition $H = Q\Lambda Q^\top$ is $\mu$ -incoherent if $\max_{i,j} |Q_{ij}| = \max_{i,j} |e_i^\top Q e_j| \le \mu/\sqrt{n}.$ A weight matrix $W \in \mathbb{R}^{m\times n}$ is $\mu$ -incoherent if $\max_{i,j} |W_{ij}| \le \mu \|W\|_F / \sqrt{mn}$ .

Small $\mu$ means no entry of $Q$ or $W$ is much larger than the average. The random Hadamard transform (RHT) achieves this with high probability. Let $V_k \in \mathbb{R}^{k\times k}$ be the (deterministic) Hadamard matrix and $S_k$ a length- $k$ random sign vector. Then $\tilde W = V_m S_m W S_n V_n^\top, \qquad \tilde H = V_n S_n H S_n V_n^\top$ satisfies, with probability at least $1 - \delta$ , $\mu_{\tilde W} = 2 \log(4mn/\delta).$ Because $V_k$ and $S_k$ are orthogonal, the per-layer loss is invariant: $\|\tilde W \tilde x - \tilde y\|^2 = \|W x - y\|^2$ on conjugated activations. Quantization is then performed in the rotated domain, where $\tilde W$ has entries that are approximately independent and Gaussian-distributed by a CLT-style argument over the random signs. This reduces the quantization problem to: code an i.i.d. Gaussian source.

Trellis-coded quantization

A length- $T$ scalar sequence $S = (s_1, \dots, s_T)$ is quantized to $\hat S$ by walking a directed graph $G$ — the trellis — with $2^L$ nodes, each carrying an outgoing edge to $2^{kV}$ neighbors and a node value in $\mathbb{R}^V$ . The reconstruction $\hat S$ is the concatenated sequence of node values along the chosen walk. Each step records which of the $2^{kV}$ outgoing edges was taken, which costs $kV$ bits per length- $V$ output block, i.e. $k$ bits per scalar. The codebook is implicit: it is the set of all length- $T$ walks on $G$ , which has size $2^{L} \cdot (2^{kV})^{T/V - 1}$ . Storing this explicitly would be infeasible; the trellis structure stores it in $O(2^L)$ space.

For the squared-error distortion metric, the optimal walk is found by Viterbi dynamic programming. Let $\mathcal V_t(y)$ be the minimum cumulative error among all length- $t$ walks ending at node $y$ . The recurrence is $\mathcal V_t(y) = \min_{(x,y) \in G}\, \mathcal V_{t-1}(x) + \|C_y - s_t\|^2$ where $C_y \in \mathbb{R}^V$ is the value of node $y$ . With $|G| = 2^L$ nodes and $T/V$ time steps, total work is $O(2^L \cdot T)$ — linear in $T$ , not exponential. This is the central asymptotic difference from VQ. Brute-force searching a $2^{kT}$ codebook (the $T$ -dimensional VQ analogue) is exponential in $T$ .

The optimal quantization distortion under a $k$ -bit i.i.d. Gaussian source is the rate-distortion function $D_R$ , which Marcellin and Fischer (1989) showed TCQ approaches as $L \to \infty$ . For $k = 2$ on a unit-variance Gaussian, the scalar Lloyd-Max quantizer has MSE $0.118$ , QuIP#'s 8-dimensional $E_8$ codebook achieves $0.089$ , QTIP's 256-dimensional ( $L = 16$ ) TCQ achieves $0.069$ , against the information-theoretic lower bound $D_R = 0.063$ .

The bitshift trellis

A general trellis with $|G| = 2^L$ nodes requires storing a $2^L \times 2^{kV}$ adjacency table and a $2^L \times V$ node-value codebook in inference cache, which is prohibitive for $L \gtrsim 12$ . QTIP uses the bitshift trellis (Mao & Gray, 2010) to remove the adjacency table entirely. In the bitshift trellis, node $i$ has an edge to node $j$ iff $j = \lfloor (i \cdot 2^{kV} \bmod 2^L) \rfloor + c, \qquad c \in \{0, 1, \dots, 2^{kV} - 1\}.$ Equivalently, the top $L - kV$ bits of $j$ equal the bottom $L - kV$ bits of $i$ . The encoded walk is a single bitstring; advancing one step is a left-shift by $kV$ bits with the new $kV$ bits packed in. Decoding the $t$ -th block of $V$ weights only depends on bits $\{(t-1)kV + 1, \ldots, (t-1)kV + L\}$ of $\hat S$ , so blocks can be decoded in parallel with no inter-block sequentiality. There is no adjacency table to store; the structure is implicit in the bit layout.

Lookup-free Gaussian codes

Even with the bitshift trellis, a vanilla TCQ still needs the size- $2^L$ codebook of node values $C_y \in \mathbb{R}^V$ . For $L = 16$ , a 16-bit codebook of 16-bit floats fits in 128 KB — tight in L1, lost in L2. QTIP introduces compute-based Gaussian codes that produce a pseudorandom approximate Gaussian from an $L$ -bit input in $\le 4$ GPU instructions, with no lookup at all.

The "1MAD" code is the simplest. Given $L$ -bit input $x$ , run a linear congruential generator to scramble bits, then sum the four bytes: $x \leftarrow (a \cdot x + b) \bmod 2^{32}, \qquad y = (x \, \& \, 0\text{xff}) + ((x \gg 8) \, \& \, 0\text{xff}) + ((x \gg 16) \, \& \, 0\text{xff}) + ((x \gg 24) \, \& \, 0\text{xff})$ $C_x = (y - 510)/147.8.$ Each byte is a roughly uniform $[0, 255]$ value; the sum is approximately $\mathcal N(510, 147.8^2)$ by the CLT, so $C_x$ is approximately $\mathcal N(0, 1)$ . The constants $a, b$ are chosen so neighboring trellis nodes — which share many bit positions — produce near-uncorrelated outputs (the empirical correlation graph is shown in Figure 3 of the paper). The "3INST" code is a 3-instruction variant using float bit-tricks that achieves the same effect without an LCG-style multiply.

The MSE numbers in Table 1 of the paper: 1MAD gives $0.069$ , 3INST gives $0.069$ , vs. the RPTC random-permutation code (a true random Gaussian codebook stored as a lookup) at $0.068$ — the lookup-free codes match the random codebook within rounding noise. Inference now requires zero codebook reads per weight; the decoder is pure ALU.

Tail-biting trellises

A length- $T$ encoded walk costs $kT + L - kV$ bits because the starting state takes an extra $L - kV$ bits. To match GPU word boundaries (typically $w = 32$ ), QTIP requires that the start and end states share $L - kV$ bits — a tail-biting trellis. Exact tail-biting via DP is $O(2^L T)$ time and intractable for $L \ge 12$ . Algorithm 4 in the paper gives an approximation: rotate the sequence by $T/2$ , run Viterbi, take the overlap of the Viterbi-decoded start and end, then re-run Viterbi on the original sequence with that overlap fixed. Two Viterbi calls instead of $2^{L-kV}$ candidate state pairs. Empirically the MSE is within $0.0005$ of optimal (Table 2 of the paper for $T = 256$ , $k \in \{1, 2, 3, 4\}$ ).

Why this works at 2 bits where VQ stalls

The VQ quality ceiling at fixed $kd$ -bit codebooks comes from finite-dimension shape distortion: at $k = 2$ , $d = 8$ the codebook lives in $\mathbb{R}^8$ and cannot perfectly tile the 8-dimensional Gaussian. Adding more dimensions improves the tiling (the rate-distortion function is monotone in $d$ ), but the codebook size $2^{kd}$ blows up. QTIP gets $d > 100$ effectively, with a $2^L = 65536$ -state trellis that does fit in cache, and uses the trellis edges instead of unstructured codebook entries to capture the high-dimensional shape. The quantization error halves the gap to FP16 in 2-bit Llama-2-70B (Table 5 in the paper) precisely because the Gaussian source coding rate is closer to $D_R$ at high $d$ .

What QTIP does not do

It is a weight-only PTQ method. Activations stay in FP16 or BF16. It is orthogonal to KV-cache quantization, attention quantization, and quantization-aware training; it composes with any of them. It assumes the layer-wise proxy loss is a good objective; for layers where it isn't (e.g., the embedding table or output head, which hit different distributions), the paper recommends keeping those at higher precision. It also does not eliminate calibration data — the Hessian $H$ is computed from a few hundred forward passes on a representative corpus.

Connections to TheoremPath Topics

Quantization theory — the rate-distortion framing QTIP optimizes against.
Model compression and pruning — broader landscape; QTIP is specifically weight-only PTQ.
Speculative decoding and quantization — the inference-time companion technique.
Inner product spaces and orthogonality — the orthogonal Hadamard transform that does the incoherence processing.
Dimensionality reduction theory — the Johnson-Lindenstrauss line of work that proves random projections approximately preserve $\ell_2$ norms.
Measure concentration and geometric FA — the concentration arguments that make $\tilde W$ approximately Gaussian.
Llama and open-weight models — the empirical target; QTIP tables report Llama-2-7B/13B/70B and Llama-3 numbers.

Why It Matters Now

The systems landscape for LLM inference is bandwidth-bound. Half-precision Llama-3-405B needs 810 GB just for weights; even with 8-way tensor parallel on H200s, the per-step cost is dominated by HBM-to-SMC weight streaming. Two-bit weights mean the same model fits in 100 GB and the per-step bandwidth cost drops $8\times$ . The question is not whether to quantize — every production deployment does — but how aggressively and at what quality cost.

Before QTIP, the answer was "down to 4 bits cleanly, 3 bits if you can tolerate $\sim 0.5$ ppl, 2 bits with a real quality cliff." QTIP closes most of the 2-bit gap. The empirical numbers on Llama-3-70B at 2 bits are now within $1$ Wikitext2 perplexity of FP16, which is roughly the gap from one Llama-3 generation to the next. Two-bit serving is suddenly competitive.

The methodological lesson is broader. The dominant prior on weight distributions was wrong: VQ-style methods spent their representational budget shape-fitting an idiosyncratic empirical distribution, and the codebook size was the binding constraint. Pre-rotating to make the distribution i.i.d. Gaussian moves the problem into a regime where 60 years of source-coding theory directly applies. The right reference points are Marcellin-Fischer 1990 and Cover-Thomas Chapter 13, not the LLM-PTQ literature. This is a clean case of "apply the textbook result once you have set up the assumptions."

The trellis design choices — bitshift structure, compute-based codes, tail-biting approximation — are GPU-specific. The paper is honest that the engineering wins come from matching the code to the hardware: 4 ALU instructions per weight matters because that is what an H100 SM can execute alongside the matrix-multiply. Codes designed for AMD MI300, Apple Silicon, or future TPU generations would look different. The general framework — incoherence processing plus trellis coding — is hardware-agnostic; the specific 1MAD and 3INST instantiations are not.

References

Canonical:

Tseng, A., Sun, Q., Hou, D., & De Sa, C. (2024). "QTIP: Quantization with Trellises and Incoherence Processing." NeurIPS 2024. arXiv:2406.11235.

Direct precursors (incoherence processing):

Chee, J., Cai, Y., Kuleshov, V., & De Sa, C. (2024). "QuIP: 2-Bit Quantization of Large Language Models With Guarantees." NeurIPS 2023. arXiv:2307.13304. Introduces incoherence processing for LLM PTQ.
Tseng, A., Chee, J., Sun, Q., Kuleshov, V., & De Sa, C. (2024). "QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks." ICML 2024. arXiv:2402.04396. The 8-dimensional $E_8$ -lattice predecessor.

Direct precursors (trellis coding):

Marcellin, M. W., & Fischer, T. R. (1990). "Trellis coded quantization of memoryless and Gauss-Markov sources." IEEE Transactions on Communications, 38(1). The original TCQ paper.
Mao, X., & Gray, R. M. (2010). "On a normal-form for the random permutation trellis coder." IEEE Transactions on Information Theory, 56(10). The bitshift trellis structure QTIP uses.

Competing PTQ methods:

Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2023). "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." ICLR 2023. arXiv:2210.17323. The Hessian-aware sequential rounding baseline.
Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., Gan, C., & Han, S. (2024). "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." MLSys 2024. arXiv:2306.00978. Outlier-aware scaling baseline.
Egiazarian, V., Panferov, A., Kuznedelev, D., Frantar, E., Babenko, A., & Alistarh, D. (2024). "Extreme Compression of Large Language Models via Additive Quantization." ICML 2024. arXiv:2401.06118. AQLM, the high-quality VQ baseline QTIP outperforms.

The Hessian-loss framing:

Nagel, M., Amjad, R. A., van Baalen, M., Louizos, C., & Blankevoort, T. (2020). "Up or Down? Adaptive Rounding for Post-Training Quantization." ICML. arXiv:2004.10568. The per-layer proxy loss QTIP minimizes.

Standard textbook:

Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory, 2nd ed. Wiley. Chapter 13 — rate-distortion theory; the $D_R$ that QTIP approaches.

Connected topics

Last reviewed: May 6, 2026