Efficient Transformers Survey

Sneiderman, Robby

LLM Construction

Efficient Transformers Survey

Sub-quadratic attention variants (linear attention, Linformer, Performer, Longformer, BigBird) and why FlashAttention, a hardware-aware exact method, made most of them unnecessary in practice.

AdvancedTier 2FrontierSupporting~55 min

Prerequisites

Attention Variants and Efficiency

Quiz (3)Prereq Map

Why This Matters

Standard self-attention computes all $n^2$ pairwise interactions for a sequence of length $n$ , costing $O(n^2 d)$ time and $O(n^2)$ memory. For long sequences (documents, genomes, high-resolution images), this quadratic cost is prohibitive.

Between 2019 and 2022, dozens of papers proposed sub-quadratic attention variants. Most are now rarely used. FlashAttention (2022) showed that the bottleneck was memory bandwidth, not FLOPs, and that exact attention with IO-aware tiling is faster in practice than most approximate methods up to sequences of length 8K-16K. Understanding why this happened teaches more about systems-level thinking than about any single algorithm.

The Quadratic Bottleneck

For queries $Q$ , keys $K$ , values $V \in \mathbb{R}^{n \times d}$ :

$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d}}\right) V$

The matrix $QK^\top \in \mathbb{R}^{n \times n}$ has $n^2$ entries. Storing it costs $O(n^2)$ memory. For $n = 100{,}000$ and $d = 128$ , the attention matrix alone requires 40 GB in float32. The question: can we avoid materializing this matrix?

Linear Attention

Proposition

Linear Attention via Kernel Decomposition

Statement

Replace the softmax kernel $\kappa(q, k) = \exp(q^\top k / \sqrt{d})$ with a feature map $\phi: \mathbb{R}^d \to \mathbb{R}^m$ such that $\kappa(q, k) \approx \phi(q)^\top \phi(k)$ . Then:

$\text{Attention}(Q, K, V)_i = \frac{\phi(Q_i)^\top \sum_j \phi(K_j) V_j^\top}{\phi(Q_i)^\top \sum_j \phi(K_j)}$

The sums $S = \sum_j \phi(K_j) V_j^\top \in \mathbb{R}^{m \times d}$ and $z = \sum_j \phi(K_j) \in \mathbb{R}^m$ can be precomputed in $O(nmd)$ time. Each attention output costs $O(md)$ , giving total cost $O(nmd)$ , which is linear in $n$ if $m$ is fixed.

Intuition

Instead of computing the full $n \times n$ attention matrix and then multiplying by $V$ , change the order of operations. First aggregate key-value pairs into a compact summary (the matrix $S$ ), then query this summary. This is the associativity trick: $(QK^\top)V = Q(K^\top V)$ up to normalization.

Proof Sketch

Write $\text{Attn}_i = \sum_j \kappa(Q_i, K_j) V_j / \sum_j \kappa(Q_i, K_j)$ . Substitute $\kappa(Q_i, K_j) = \phi(Q_i)^\top \phi(K_j)$ and factor out $\phi(Q_i)^\top$ from both sums.

Why It Matters

Linear attention reduces the complexity from $O(n^2 d)$ to $O(n m d)$ . For $m \ll n$ , this is a large speedup for very long sequences. The idea also enables recurrent computation: process tokens one by one, updating $S$ and $z$ incrementally, giving constant memory per step.

Failure Mode

The approximation quality depends on $\phi$ . No known feature map perfectly approximates softmax attention. The approximation error degrades attention's ability to do sharp, sparse lookups (e.g., copying a specific token). In practice, linear attention models underperform standard attention on tasks requiring precise retrieval from context.

report a correction →

Sparse Attention Methods

Rather than approximating the softmax, sparse methods compute exact attention on a subset of the $n^2$ pairs.

Longformer (Beltagy et al., 2020): Each token attends to a fixed-size local window (e.g., 512 tokens around it) plus a small set of global tokens. Cost is $O(nw)$ where $w$ is the window size. Good for document-level tasks where local context dominates.

BigBird (Zaheer et al., 2020): Combines local window attention, global tokens, and random attention edges. Proved that this sparse pattern is a universal approximator of sequence-to-sequence functions and is Turing complete, matching the theoretical expressiveness of full attention.

Block-sparse attention: Partition tokens into blocks and attend only within blocks (or within a structured pattern of blocks). Efficient on GPUs because block operations map well to hardware. This is not the same as FlashAttention: FlashAttention computes the full, dense exact softmax attention and only uses block-wise tiling of the same arithmetic to fit the inner products in fast on-chip SRAM. Block-sparse attention drops pairs; FlashAttention drops nothing.

Low-Rank Approximations

Linformer (Wang et al., 2020): Project keys and values to a lower dimension $k \ll n$ using learned projection matrices $E, F \in \mathbb{R}^{k \times n}$ :

$\text{Attention}(Q, EK, FV) = \text{softmax}\!\left(\frac{Q(EK)^\top}{\sqrt{d}}\right) FV$

Cost is $O(nkd)$ instead of $O(n^2d)$ . The claim: the attention matrix is approximately low-rank for most practical inputs, so projecting to rank $k$ loses little information.

Performer (Choromanski et al., 2021): Uses positive random features (the FAVOR+ scheme) to give an unbiased estimate of the exponential dot-product kernel that defines softmax attention:

$\exp(q^\top k / \sqrt{d}) \approx \phi(q)^\top \phi(k)$

where $\phi(x) = \frac{1}{\sqrt{m}} \exp(-\|x\|^2/2) [\exp(\omega_1^\top x), \ldots, \exp(\omega_m^\top x)]$ with random $\omega_i \sim \mathcal{N}(0, I_d)$ . The $1/\sqrt{d}$ factor that appears in the standard scaled-dot-product attention is absorbed into the query/key pre-normalization, not into the covariance of $\omega_i$ . This is not the classical Rahimi-Recht random Fourier feature construction (which targets shift-invariant kernels via cosines): the exponential dot-product kernel is not shift-invariant, and the trigonometric RFF estimator was empirically shown to be numerically unstable for softmax attention. FAVOR+ uses non-negative exponentials specifically to keep estimated attention weights non-negative.

Proposition

Performer Approximation Quality

Statement

The Performer estimator $\hat{\kappa}(q, k) = \phi(q)^\top \phi(k)$ is unbiased:

$\mathbb{E}[\hat{\kappa}(q, k)] = \exp(q^\top k / \sqrt{d}),$

and for fixed $(q, k)$ with bounded norms its variance scales as $O(1/m)$ . Choromanski et al. give pointwise concentration bounds and an error bound on the approximated attention output that depends polynomially on $m$ and on the maximum query/key norm; orthogonal random features (FAVOR+) further reduce variance relative to i.i.d. Gaussian draws.

Intuition

A Monte Carlo construction that mirrors the kernel-trick form of attention. Variance falls like $1/m$ , but the constants depend on how peaked the true softmax weights are: very sharp attention patterns require many more random features than smooth ones.

Proof Sketch

Unbiasedness follows from the identity $\exp(q^\top k) = \mathbb{E}_{\omega \sim \mathcal{N}}[\exp(\omega^\top q - \|q\|^2/2) \exp(\omega^\top k - \|k\|^2/2)]$ . Pointwise concentration follows from sub-exponential tail bounds for products of Gaussian-driven exponentials with bounded $q, k$ norms.

Why It Matters

This gives a principled way to trade accuracy for speed. But the constants in the bounds matter: there is no uniform relative-error guarantee at modest $m$ , and on attention distributions that are highly peaked the required $m$ can erase the asymptotic speedup over exact attention.

Failure Mode

For tasks requiring sharp attention patterns (copying, exact retrieval), the variance can be large enough that quality degrades noticeably. The classical trigonometric random-Fourier-feature construction also gives an unbiased softmax-kernel estimator in principle but produces signed estimates and is numerically unstable for attention; this is why Performer specifically uses non-negative features.

report a correction →

Why FlashAttention Won

FlashAttention computes exact softmax attention but tiles the computation to minimize HBM (high bandwidth memory) reads and writes. The key insight: the attention bottleneck for sequences up to 16K is not FLOPs but memory IO. The $n \times n$ attention matrix does not need to be materialized; it can be computed and consumed block by block.

FlashAttention achieves 2-4x wall-clock speedup over standard attention for typical LLM training sequence lengths (2K-8K), while computing the exact same output. This made most approximate attention methods unnecessary for their target use case (moderate-length sequences on modern GPUs).

Sub-quadratic methods still matter when $n$ is truly enormous (100K+ tokens), where the $O(n^2)$ FLOP cost itself dominates. But for most current LLM training and inference, FlashAttention (and its successors) suffice.

Common Confusions

Watch Out

Sub-quadratic FLOPs does not mean faster wall-clock time

A method with $O(n \log n)$ FLOPs can be slower than $O(n^2)$ if it has poor memory access patterns, cannot use tensor cores, or requires custom kernels. On modern GPUs, memory bandwidth is often the bottleneck, not arithmetic. This is why FlashAttention (which uses more FLOPs due to recomputation) is faster than standard attention (which materializes the full attention matrix).

Watch Out

Linear attention is not a drop-in replacement

Linear attention changes the model's expressive power, not just its cost. The softmax attention matrix can be arbitrarily sparse and peaked. Linear attention with a fixed feature map cannot represent sharp attention patterns. Models trained with linear attention often require architectural changes (gating, decay factors) to match softmax attention quality.

Summary

Standard attention costs $O(n^2 d)$ time and $O(n^2)$ memory
Linear attention uses the kernel trick to achieve $O(nmd)$ but sacrifices sharp attention patterns
Sparse methods (Longformer, BigBird) compute exact attention on a structured subset of pairs
Low-rank methods (Linformer) project $K, V$ to a fixed-rank subspace so the attention matrix is approximated by a low-rank factorization
Kernel-approximation methods (Performer) replace softmax with a random-feature kernel, which is not a low-rank structure on the attention matrix itself
FlashAttention computes exact attention with IO-aware tiling, making approximate methods unnecessary for $n < 16K$
For truly long sequences ( $n > 100K$ ), sub-quadratic methods and state-space models remain relevant

Exercises

ExerciseCore

Problem

In linear attention with feature map $\phi: \mathbb{R}^d \to \mathbb{R}^m$ , what are the shapes of the precomputed matrices $S = \sum_j \phi(K_j) V_j^\top$ and $z = \sum_j \phi(K_j)$ ? What is the total memory cost of storing $S$ and $z$ compared to storing the full $n \times n$ attention matrix?

ExerciseAdvanced

Problem

Explain why FlashAttention uses more FLOPs than standard attention (due to recomputation in the backward pass) yet achieves faster wall-clock time. Quantify the tradeoff: if HBM bandwidth is $B$ bytes/sec and compute throughput is $C$ FLOP/sec, under what condition on $n$ and $d$ is standard attention memory-bound?

References

Canonical:

Vaswani et al., "Attention Is All You Need" (NeurIPS 2017)
Katharopoulos et al., "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention" (ICML 2020), arXiv:2006.16236

Architecture variants cited in the body:

Beltagy, Peters, Cohan, "Longformer: The Long-Document Transformer" (2020), arXiv:2004.05150
Zaheer et al., "Big Bird: Transformers for Longer Sequences" (NeurIPS 2020), arXiv:2007.14062
Wang et al., "Linformer: Self-Attention with Linear Complexity" (2020), arXiv:2006.04768
Choromanski et al., "Rethinking Attention with Performers" (ICLR 2021), arXiv:2009.14794

Current:

Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (NeurIPS 2022), arXiv:2205.14135
Dao, "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning" (ICLR 2024), arXiv:2307.08691
Shah et al., "FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision" (2024), arXiv:2407.08608
Tay et al., "Efficient Transformers: A Survey" (ACM Computing Surveys 2022), arXiv:2009.06732

Further reading (not covered in body):

Kitaev, Kaiser, Levskaya, "Reformer: The Efficient Transformer" (ICLR 2020), arXiv:2001.04451 — LSH-based sparse attention.
Liu et al., "Ring Attention with Blockwise Transformers for Near-Infinite Context" (2023), arXiv:2310.01889 — sequence parallelism for multi-million-token contexts.
Gu and Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" (2023), arXiv:2312.00752 — the SSM-based alternative that largely replaced linear-attention work post-2023.

Next Topics

Mamba and state-space models: a recurrent architecture that avoids attention entirely for long sequences
KV cache: memory optimization for autoregressive inference

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

1

Attention Variants and Efficiencylayer 4 · tier 2

Derived topics

2

Mamba and State-Space Modelslayer 4 · tier 2
KV Cachelayer 5 · tier 2

Graph-backed continuations

Mamba and State-Space Models KV Cache