Skip to main content

Paper breakdown

QTIP: Quantization with Trellises and Incoherence Processing

Albert Tseng, Qingyao Sun, David Hou, and Christopher De Sa · 2024 · NeurIPS 2024

Post-training quantization that decouples codebook size from bitrate by replacing vector quantization with trellis-coded quantization. Uses random Hadamard incoherence processing to make LLM weights approximately i.i.d. Gaussian, then quantizes 256-dimensional Gaussian sequences via a Viterbi-decoded bitshift trellis. Two-bit Llama-2-70B reaches within 0.7 perplexity of FP16.

Overview

Tseng et al. (2024) introduce QTIP, a post-training quantization (PTQ) method for large language model weights that combines two ideas: trellis-coded quantization (TCQ) instead of vector quantization (VQ), and incoherence processing via random orthogonal transforms to standardize the weight distribution before coding. The motivation is simple. LLM inference is memory-bandwidth bound: a 70B-parameter model in FP16 needs 140 GB just to load weights, and decoding throughput is throttled by how fast the weight tile can stream from HBM into SRAM. Halving the precision halves the memory traffic and roughly doubles the throughput. Two-bit weights would multiply throughput by eight at the cost of 22=4\le 2^2 = 4 representable levels per scalar, which is far too coarse for direct scalar quantization.

Vector quantization mitigates this by quantizing groups of dd scalars at a time to one of 2kd2^{kd} codebook vectors. The codebook is a learned set of points in Rd\mathbb{R}^d, chosen to fit the weight distribution. Higher dd means better packing density but the codebook size 2kd2^{kd} explodes; the prior state-of-the-art (QuIP#, AQLM) is stuck at d8d \le 8 because the codebook must fit in L1 cache for fast inference. QTIP replaces the unstructured codebook with a stateful trellis decoder, so the effective codebook is the set of length-TT walks on a small directed graph. This decouples the bitrate (bits per scalar) from the codebook size and from the effective dimension, allowing d>100d > 100 at no cache cost.

The result is that two-bit Llama-2-70B with QTIP reaches Wikitext2 perplexity 3.873.87 versus FP16 at 3.123.12 — about half the gap of QuIP# (4.164.16) or AQLM (3.833.83, slower decoder). Three- and four-bit variants are within 0.05\sim 0.05 of FP16 across model sizes. The decoder runs at 188188 tokens/s for 7B and 23.523.5 tokens/s for 70B at batch one on an RTX 6000 Ada — strictly faster than QuIP# at the same bitrate and dramatically faster than AQLM.

Mathematical Contributions

The per-layer proxy loss

Quantizing the entire network end-to-end is intractable for a 70B model. Following the standard PTQ template (Nagel et al., 2020), QTIP minimizes the per-layer reconstruction error in input-activated form:

(W^)=Ex ⁣[(W^W)x2]=tr ⁣((W^W)H(W^W))\ell(\hat W) = \mathbb{E}_x\!\left[\|(\hat W - W)x\|^2\right] = \mathrm{tr}\!\left((\hat W - W)\, H\, (\hat W - W)^\top\right)

where WRm×nW \in \mathbb{R}^{m\times n} is the FP16 weight matrix, W^\hat W is its quantized counterpart, and H=Ex[xx]Rn×nH = \mathbb{E}_x[xx^\top] \in \mathbb{R}^{n\times n} is the input second-moment matrix, computed from a calibration set of a few hundred forward passes. The Hessian HH tells the quantizer which directions matter: rounding error in eigen-directions of HH with large eigenvalues hurts loss the most, error in the null-space of HH is free.

Incoherence processing

The Hessian and the weight matrix are not Gaussian or even rotationally symmetric; they have outliers, heavy tails, and structured spikes that hurt vector-quantization shapes built for i.i.d. data. Chee et al. (2024, QuIP#) proposed a fix: pre-multiply WW and HH by random orthogonal matrices to flatten the distribution.

Definition 2.1 (μ\mu-incoherence). A matrix HRn×nH \in \mathbb{R}^{n\times n} with eigendecomposition H=QΛQH = Q\Lambda Q^\top is μ\mu-incoherent if maxi,jQij=maxi,jeiQejμ/n.\max_{i,j} |Q_{ij}| = \max_{i,j} |e_i^\top Q e_j| \le \mu/\sqrt{n}. A weight matrix WRm×nW \in \mathbb{R}^{m\times n} is μ\mu-incoherent if maxi,jWijμWF/mn\max_{i,j} |W_{ij}| \le \mu \|W\|_F / \sqrt{mn}.

Small μ\mu means no entry of QQ or WW is much larger than the average. The random Hadamard transform (RHT) achieves this with high probability. Let VkRk×kV_k \in \mathbb{R}^{k\times k} be the (deterministic) Hadamard matrix and SkS_k a length-kk random sign vector. Then W~=VmSmWSnVn,H~=VnSnHSnVn\tilde W = V_m S_m W S_n V_n^\top, \qquad \tilde H = V_n S_n H S_n V_n^\top satisfies, with probability at least 1δ1 - \delta, μW~=2log(4mn/δ).\mu_{\tilde W} = 2 \log(4mn/\delta). Because VkV_k and SkS_k are orthogonal, the per-layer loss is invariant: W~x~y~2=Wxy2\|\tilde W \tilde x - \tilde y\|^2 = \|W x - y\|^2 on conjugated activations. Quantization is then performed in the rotated domain, where W~\tilde W has entries that are approximately independent and Gaussian-distributed by a CLT-style argument over the random signs. This reduces the quantization problem to: code an i.i.d. Gaussian source.

Trellis-coded quantization

A length-TT scalar sequence S=(s1,,sT)S = (s_1, \dots, s_T) is quantized to S^\hat S by walking a directed graph GG — the trellis — with 2L2^L nodes, each carrying an outgoing edge to 2kV2^{kV} neighbors and a node value in RV\mathbb{R}^V. The reconstruction S^\hat S is the concatenated sequence of node values along the chosen walk. Each step records which of the 2kV2^{kV} outgoing edges was taken, which costs kVkV bits per length-VV output block, i.e. kk bits per scalar. The codebook is implicit: it is the set of all length-TT walks on GG, which has size 2L(2kV)T/V12^{L} \cdot (2^{kV})^{T/V - 1}. Storing this explicitly would be infeasible; the trellis structure stores it in O(2L)O(2^L) space.

For the squared-error distortion metric, the optimal walk is found by Viterbi dynamic programming. Let Vt(y)\mathcal V_t(y) be the minimum cumulative error among all length-tt walks ending at node yy. The recurrence is Vt(y)=min(x,y)GVt1(x)+Cyst2\mathcal V_t(y) = \min_{(x,y) \in G}\, \mathcal V_{t-1}(x) + \|C_y - s_t\|^2 where CyRVC_y \in \mathbb{R}^V is the value of node yy. With G=2L|G| = 2^L nodes and T/VT/V time steps, total work is O(2LT)O(2^L \cdot T)linear in TT, not exponential. This is the central asymptotic difference from VQ. Brute-force searching a 2kT2^{kT} codebook (the TT-dimensional VQ analogue) is exponential in TT.

The optimal quantization distortion under a kk-bit i.i.d. Gaussian source is the rate-distortion function DRD_R, which Marcellin and Fischer (1989) showed TCQ approaches as LL \to \infty. For k=2k = 2 on a unit-variance Gaussian, the scalar Lloyd-Max quantizer has MSE 0.1180.118, QuIP#'s 8-dimensional E8E_8 codebook achieves 0.0890.089, QTIP's 256-dimensional (L=16L = 16) TCQ achieves 0.0690.069, against the information-theoretic lower bound DR=0.063D_R = 0.063.

The bitshift trellis

A general trellis with G=2L|G| = 2^L nodes requires storing a 2L×2kV2^L \times 2^{kV} adjacency table and a 2L×V2^L \times V node-value codebook in inference cache, which is prohibitive for L12L \gtrsim 12. QTIP uses the bitshift trellis (Mao & Gray, 2010) to remove the adjacency table entirely. In the bitshift trellis, node ii has an edge to node jj iff j=(i2kVmod2L)+c,c{0,1,,2kV1}.j = \lfloor (i \cdot 2^{kV} \bmod 2^L) \rfloor + c, \qquad c \in \{0, 1, \dots, 2^{kV} - 1\}. Equivalently, the top LkVL - kV bits of jj equal the bottom LkVL - kV bits of ii. The encoded walk is a single bitstring; advancing one step is a left-shift by kVkV bits with the new kVkV bits packed in. Decoding the tt-th block of VV weights only depends on bits {(t1)kV+1,,(t1)kV+L}\{(t-1)kV + 1, \ldots, (t-1)kV + L\} of S^\hat S, so blocks can be decoded in parallel with no inter-block sequentiality. There is no adjacency table to store; the structure is implicit in the bit layout.

Lookup-free Gaussian codes

Even with the bitshift trellis, a vanilla TCQ still needs the size-2L2^L codebook of node values CyRVC_y \in \mathbb{R}^V. For L=16L = 16, a 16-bit codebook of 16-bit floats fits in 128 KB — tight in L1, lost in L2. QTIP introduces compute-based Gaussian codes that produce a pseudorandom approximate Gaussian from an LL-bit input in 4\le 4 GPU instructions, with no lookup at all.

The "1MAD" code is the simplest. Given LL-bit input xx, run a linear congruential generator to scramble bits, then sum the four bytes: x(ax+b)mod232,y=(x&0xff)+((x8)&0xff)+((x16)&0xff)+((x24)&0xff)x \leftarrow (a \cdot x + b) \bmod 2^{32}, \qquad y = (x \, \& \, 0\text{xff}) + ((x \gg 8) \, \& \, 0\text{xff}) + ((x \gg 16) \, \& \, 0\text{xff}) + ((x \gg 24) \, \& \, 0\text{xff}) Cx=(y510)/147.8.C_x = (y - 510)/147.8. Each byte is a roughly uniform [0,255][0, 255] value; the sum is approximately N(510,147.82)\mathcal N(510, 147.8^2) by the CLT, so CxC_x is approximately N(0,1)\mathcal N(0, 1). The constants a,ba, b are chosen so neighboring trellis nodes — which share many bit positions — produce near-uncorrelated outputs (the empirical correlation graph is shown in Figure 3 of the paper). The "3INST" code is a 3-instruction variant using float bit-tricks that achieves the same effect without an LCG-style multiply.

The MSE numbers in Table 1 of the paper: 1MAD gives 0.0690.069, 3INST gives 0.0690.069, vs. the RPTC random-permutation code (a true random Gaussian codebook stored as a lookup) at 0.0680.068 — the lookup-free codes match the random codebook within rounding noise. Inference now requires zero codebook reads per weight; the decoder is pure ALU.

Tail-biting trellises

A length-TT encoded walk costs kT+LkVkT + L - kV bits because the starting state takes an extra LkVL - kV bits. To match GPU word boundaries (typically w=32w = 32), QTIP requires that the start and end states share LkVL - kV bits — a tail-biting trellis. Exact tail-biting via DP is O(2LT)O(2^L T) time and intractable for L12L \ge 12. Algorithm 4 in the paper gives an approximation: rotate the sequence by T/2T/2, run Viterbi, take the overlap of the Viterbi-decoded start and end, then re-run Viterbi on the original sequence with that overlap fixed. Two Viterbi calls instead of 2LkV2^{L-kV} candidate state pairs. Empirically the MSE is within 0.00050.0005 of optimal (Table 2 of the paper for T=256T = 256, k{1,2,3,4}k \in \{1, 2, 3, 4\}).

Why this works at 2 bits where VQ stalls

The VQ quality ceiling at fixed kdkd-bit codebooks comes from finite-dimension shape distortion: at k=2k = 2, d=8d = 8 the codebook lives in R8\mathbb{R}^8 and cannot perfectly tile the 8-dimensional Gaussian. Adding more dimensions improves the tiling (the rate-distortion function is monotone in dd), but the codebook size 2kd2^{kd} blows up. QTIP gets d>100d > 100 effectively, with a 2L=655362^L = 65536-state trellis that does fit in cache, and uses the trellis edges instead of unstructured codebook entries to capture the high-dimensional shape. The quantization error halves the gap to FP16 in 2-bit Llama-2-70B (Table 5 in the paper) precisely because the Gaussian source coding rate is closer to DRD_R at high dd.

What QTIP does not do

It is a weight-only PTQ method. Activations stay in FP16 or BF16. It is orthogonal to KV-cache quantization, attention quantization, and quantization-aware training; it composes with any of them. It assumes the layer-wise proxy loss is a good objective; for layers where it isn't (e.g., the embedding table or output head, which hit different distributions), the paper recommends keeping those at higher precision. It also does not eliminate calibration data — the Hessian HH is computed from a few hundred forward passes on a representative corpus.

Connections to TheoremPath Topics

Why It Matters Now

The systems landscape for LLM inference is bandwidth-bound. Half-precision Llama-3-405B needs 810 GB just for weights; even with 8-way tensor parallel on H200s, the per-step cost is dominated by HBM-to-SMC weight streaming. Two-bit weights mean the same model fits in 100 GB and the per-step bandwidth cost drops 8×8\times. The question is not whether to quantize — every production deployment does — but how aggressively and at what quality cost.

Before QTIP, the answer was "down to 4 bits cleanly, 3 bits if you can tolerate 0.5\sim 0.5 ppl, 2 bits with a real quality cliff." QTIP closes most of the 2-bit gap. The empirical numbers on Llama-3-70B at 2 bits are now within 11 Wikitext2 perplexity of FP16, which is roughly the gap from one Llama-3 generation to the next. Two-bit serving is suddenly competitive.

The methodological lesson is broader. The dominant prior on weight distributions was wrong: VQ-style methods spent their representational budget shape-fitting an idiosyncratic empirical distribution, and the codebook size was the binding constraint. Pre-rotating to make the distribution i.i.d. Gaussian moves the problem into a regime where 60 years of source-coding theory directly applies. The right reference points are Marcellin-Fischer 1990 and Cover-Thomas Chapter 13, not the LLM-PTQ literature. This is a clean case of "apply the textbook result once you have set up the assumptions."

The trellis design choices — bitshift structure, compute-based codes, tail-biting approximation — are GPU-specific. The paper is honest that the engineering wins come from matching the code to the hardware: 4 ALU instructions per weight matters because that is what an H100 SM can execute alongside the matrix-multiply. Codes designed for AMD MI300, Apple Silicon, or future TPU generations would look different. The general framework — incoherence processing plus trellis coding — is hardware-agnostic; the specific 1MAD and 3INST instantiations are not.

References

Canonical:

  • Tseng, A., Sun, Q., Hou, D., & De Sa, C. (2024). "QTIP: Quantization with Trellises and Incoherence Processing." NeurIPS 2024. arXiv:2406.11235.

Direct precursors (incoherence processing):

  • Chee, J., Cai, Y., Kuleshov, V., & De Sa, C. (2024). "QuIP: 2-Bit Quantization of Large Language Models With Guarantees." NeurIPS 2023. arXiv:2307.13304. Introduces incoherence processing for LLM PTQ.
  • Tseng, A., Chee, J., Sun, Q., Kuleshov, V., & De Sa, C. (2024). "QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks." ICML 2024. arXiv:2402.04396. The 8-dimensional E8E_8-lattice predecessor.

Direct precursors (trellis coding):

  • Marcellin, M. W., & Fischer, T. R. (1990). "Trellis coded quantization of memoryless and Gauss-Markov sources." IEEE Transactions on Communications, 38(1). The original TCQ paper.
  • Mao, X., & Gray, R. M. (2010). "On a normal-form for the random permutation trellis coder." IEEE Transactions on Information Theory, 56(10). The bitshift trellis structure QTIP uses.

Competing PTQ methods:

  • Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2023). "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." ICLR 2023. arXiv:2210.17323. The Hessian-aware sequential rounding baseline.
  • Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., Gan, C., & Han, S. (2024). "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." MLSys 2024. arXiv:2306.00978. Outlier-aware scaling baseline.
  • Egiazarian, V., Panferov, A., Kuznedelev, D., Frantar, E., Babenko, A., & Alistarh, D. (2024). "Extreme Compression of Large Language Models via Additive Quantization." ICML 2024. arXiv:2401.06118. AQLM, the high-quality VQ baseline QTIP outperforms.

The Hessian-loss framing:

  • Nagel, M., Amjad, R. A., van Baalen, M., Louizos, C., & Blankevoort, T. (2020). "Up or Down? Adaptive Rounding for Post-Training Quantization." ICML. arXiv:2004.10568. The per-layer proxy loss QTIP minimizes.

Standard textbook:

  • Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory, 2nd ed. Wiley. Chapter 13 — rate-distortion theory; the DRD_R that QTIP approaches.

Connected topics

Last reviewed: May 6, 2026