Skip to main content

LLM Construction

Attention Mechanism Theory

Mathematical formulation of attention: scaled dot-product attention as soft dictionary lookup, why scaling by the square root of key dimension prevents softmax saturation, multi-head attention, and the connection to kernel methods.

ImportantAdvancedTier 2CurrentCore spine~60 min
For:ML

Why This Matters

theorem visual

Attention Scaling View

Scaling the dot-product keeps logits inside the softmax regime where multiple keys can still compete.

d_k = 64

logits

syntax0.72
entity0.96
recency1.08
target1.18

softmax weights

syntax0.19
entity0.24
recency0.27
target0.30

Interpretation

Scaling keeps the logits comparable even as key dimension grows, so several keys can still share attention mass.

Good attention keeps multiple plausible matches alive while gradients are still informative.

Attention is the computational primitive that makes transformers work. Every modern LLM. GPT-4, Claude, Gemini, Llama. processes information through billions of attention operations. Understanding attention mathematically means understanding why the specific formula softmax(QK/dk)V\text{softmax}(QK^\top / \sqrt{d_k})V takes the form it does, what goes wrong without the scaling factor, and how attention relates to classical ideas in statistics and kernel methods.

This topic focuses on the mathematical theory of attention itself, separated from the full transformer architecture, so you can build precise intuition for this single operation before composing it into larger systems.

Mental Model

Attention is a soft dictionary lookup. You have a query ("what am I looking for?"), a set of keys ("what does each entry contain?"), and a set of values ("what information does each entry carry?"). The query is compared against all keys to produce similarity scores. These scores become weights via softmax, and the output is a weighted sum of values.

Unlike a hard dictionary lookup (which returns the value of the exact matching key), attention returns a blend of all values, weighted by how well each key matches the query. This soft matching is what allows attention to combine information from multiple positions in a differentiable way.

Formal Setup and Notation

Let nn be the sequence length and dd the model dimension. The input is a matrix XRn×dX \in \mathbb{R}^{n \times d} where each row is a token embedding.

We project the input into three spaces:

Q=XWQRn×dk,K=XWKRn×dk,V=XWVRn×dvQ = XW_Q \in \mathbb{R}^{n \times d_k}, \quad K = XW_K \in \mathbb{R}^{n \times d_k}, \quad V = XW_V \in \mathbb{R}^{n \times d_v}

where WQ,WKRd×dkW_Q, W_K \in \mathbb{R}^{d \times d_k} and WVRd×dvW_V \in \mathbb{R}^{d \times d_v} are learned projection matrices.

Scaled Dot-Product Attention

Definition

Scaled Dot-Product Attention

The scaled dot-product attention function is:

Attention(Q,K,V)=softmax(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V

where the softmax is applied independently to each row of the n×nn \times n matrix QK/dkQK^\top / \sqrt{d_k}.

For a single query qiq_i (the ii-th row of QQ), the output is:

outputi=j=1nαijvj,αij=exp(qikj/dk)=1nexp(qik/dk)\text{output}_i = \sum_{j=1}^{n} \alpha_{ij} \, v_j, \qquad \alpha_{ij} = \frac{\exp(q_i^\top k_j / \sqrt{d_k})}{\sum_{\ell=1}^{n} \exp(q_i^\top k_\ell / \sqrt{d_k})}

The attention weights αij\alpha_{ij} form a probability distribution over positions: αij0\alpha_{ij} \geq 0 and jαij=1\sum_j \alpha_{ij} = 1.

Why Scale by dk\sqrt{d_k}

Proposition

Scaling Prevents Softmax Saturation

Statement

Assume q1,,qdk,k1,,kdkq_1, \ldots, q_{d_k}, k_1, \ldots, k_{d_k} are mutually independent random variables, each with mean 00 and variance 11. Then the dot product qkq^\top k has:

E[qk]=0,Var(qk)=dk\mathbb{E}[q^\top k] = 0, \qquad \text{Var}(q^\top k) = d_k

The scaled dot product qk/dkq^\top k / \sqrt{d_k} has variance 11, regardless of dkd_k.

Intuition

The dot product is a sum of dkd_k independent terms qikiq_i k_i, each with variance 11. By the independence assumption, the variance of the sum is the sum of the variances: dkd_k. Without scaling, as dkd_k grows, the dot products grow in magnitude, pushing the softmax inputs into regions where the gradient is near zero (softmax saturation). Dividing by dk\sqrt{d_k} normalizes the variance to 11, keeping the softmax in a regime with useful gradients.

Proof Sketch

Each entry qikiq_i k_i has E[qiki]=E[qi]E[ki]=0\mathbb{E}[q_i k_i] = \mathbb{E}[q_i]\mathbb{E}[k_i] = 0 and Var(qiki)=E[qi2]E[ki2]=11=1\text{Var}(q_i k_i) = \mathbb{E}[q_i^2]\mathbb{E}[k_i^2] = 1 \cdot 1 = 1 (using independence and the fact that mean is zero).

The dot product qk=i=1dkqikiq^\top k = \sum_{i=1}^{d_k} q_i k_i is a sum of dkd_k independent random variables, so Var(qk)=dk\text{Var}(q^\top k) = d_k.

After scaling: Var(qk/dk)=dk/dk=1\text{Var}(q^\top k / \sqrt{d_k}) = d_k / d_k = 1.

Why It Matters

Without scaling, a model with dk=512d_k = 512 would have dot products with standard deviation 51222.6\sqrt{512} \approx 22.6. Softmax inputs of magnitude 20+ produce outputs extremely close to 0 or 1, with gradients on the order of 10910^{-9}. Training would effectively freeze. The dk\sqrt{d_k} scaling is not a minor numerical convenience. It is essential for trainability.

Failure Mode

The assumption that qq and kk entries are independent with unit variance holds approximately at initialization (with proper weight initialization) but may not hold after training. In practice, the model learns to calibrate its own attention logits, so the scaling factor becomes less critical later in training. However, removing it entirely still causes training instability.

Attention as Soft Dictionary Lookup

Definition

Attention as Soft Dictionary Lookup

A hard dictionary lookup with query qq, keys {k1,,kn}\{k_1, \ldots, k_n\}, and values {v1,,vn}\{v_1, \ldots, v_n\} returns vjv_{j^*} where j=argmaxjsim(q,kj)j^* = \arg\max_j \, \text{sim}(q, k_j) for some similarity function.

Attention replaces the hard argmax\arg\max with a soft weighting:

output=j=1nexp(sim(q,kj))exp(sim(q,k))soft selection weightvj\text{output} = \sum_{j=1}^{n} \underbrace{\frac{\exp(\text{sim}(q, k_j))}{\sum_\ell \exp(\text{sim}(q, k_\ell))}}_{\text{soft selection weight}} \cdot v_j

The output is a convex combination of all values, with higher weight on values whose keys are more similar to the query.

In the transformer, the similarity function is sim(q,k)=qk/dk\text{sim}(q, k) = q^\top k / \sqrt{d_k}. Scaling by 1/dk1/\sqrt{d_k} is a fixed normalization, not a learned temperature. When dot-product magnitudes grow (at inference time with unusually large query norms, or when studying asymptotic behavior) the softmax becomes peaky and the soft lookup approaches a hard one.

Watch Out

Attention is not literally dictionary lookup

The soft-dictionary analogy is a useful mental model for the mechanics of a single attention head. It does not capture what attention computes in a trained transformer. Learned Q,K,VQ, K, V projections make queries and keys live in spaces that have no relation to a human-interpretable notion of "matching." Mechanistic interpretability work (Elhage et al. 2021, Olsson et al. 2022) shows heads implementing copying, induction, positional patterns, and composition. Use the analogy to understand the formula, not to predict what heads do.

Self-Attention vs Cross-Attention

Definition

Self-Attention

In self-attention, queries, keys, and values all come from the same input sequence:

Q=XWQ,K=XWK,V=XWVQ = XW_Q, \quad K = XW_K, \quad V = XW_V

Each token attends to all tokens in the same sequence (including itself). Self-attention is how a model builds contextual representations.

Definition

Cross-Attention

In cross-attention, queries come from one sequence and keys/values come from another:

Q=XtargetWQ,K=XsourceWK,V=XsourceWVQ = X_{\text{target}} W_Q, \quad K = X_{\text{source}} W_K, \quad V = X_{\text{source}} W_V

This is used in encoder-decoder models (e.g., for translation: the decoder attends to the encoder output) and in retrieval-augmented generation.

The mathematical formulation is identical. The only difference is whether QQ and K,VK, V derive from the same or different input matrices.

Watch Out

Decoder self-attention is causally masked

The unmasked formulation above lets every position attend to every other position. Decoder-only transformers (GPT-style) and the decoder of encoder-decoder models apply a causal mask: the logit qikj/dkq_i^\top k_j /\sqrt{d_k} is set to -\infty for j>ij > i before the softmax, so position ii only attends to positions i\leq i. This is what makes autoregressive next-token prediction well-defined and is also what makes the KV cache correct, since past keys and values do not depend on future tokens.

Multi-Head Attention

Definition

Multi-Head Attention

Instead of a single attention function, compute hh heads in parallel on lower-dimensional projections:

headi=Attention(XWQ(i),XWK(i),XWV(i))\text{head}_i = \text{Attention}(XW_Q^{(i)}, XW_K^{(i)}, XW_V^{(i)})

where WQ(i),WK(i)Rd×dkW_Q^{(i)}, W_K^{(i)} \in \mathbb{R}^{d \times d_k} and WV(i)Rd×dvW_V^{(i)} \in \mathbb{R}^{d \times d_v} with dk=dv=d/hd_k = d_v = d / h.

Concatenate and project:

MHA(X)=[head1;;headh]WO\text{MHA}(X) = [\text{head}_1; \ldots; \text{head}_h] \, W_O

where WORd×dW_O \in \mathbb{R}^{d \times d}.

Proposition

Multi-Head Attention Representational Capacity

Statement

Under the Vaswani et al. (2017) convention dk=dv=d/hd_k = d_v = d / h and ignoring bias terms, multi-head attention has the same weight parameter count as single-head attention with full head dimension dd:

Params(MHA)=h3d(d/h)+d2=3d2+d2=4d2\text{Params(MHA)} = h \cdot 3 \cdot d \cdot (d/h) + d^2 = 3d^2 + d^2 = 4d^2

Params(single-head)=3dd+d2=4d2\text{Params(single-head)} = 3 \cdot d \cdot d + d^2 = 4d^2

However, MHA can represent richer attention patterns: each head can specialize in a different type of relationship (syntactic, semantic, positional), and the output projection WOW_O learns how to combine these patterns.

The equality breaks under other widely used choices. Including biases adds 4d4d parameters. Multi-query attention (MQA) shares WK,WVW_K, W_V across all heads, giving d2+2d(d/h)+d2d^2 + 2d \cdot (d/h) + d^2. Grouped-query attention (GQA) interpolates between these extremes. See the MQA/GQA page for the exact counts.

Intuition

Multiple heads give the model multiple independent "attention channels." One head might track subject-verb agreement, another might track coreference, a third might focus on adjacent tokens. Single-head attention is forced to compress all these patterns into a single set of attention weights, which is a lossy compression. Multi-head attention avoids this by giving each pattern its own subspace.

Why It Matters

Empirically, reducing to a single head significantly degrades performance. The multi-head structure is one of the most important design decisions in the transformer. Mechanistic interpretability research has shown that individual heads do specialize: there are "induction heads" that copy patterns, "previous token heads" that attend to the immediately preceding token, and "name mover heads" that track entities. The "one head, one function" picture is a simplification: many heads are polysemantic (a single head implements several behaviors that are entangled across inputs), and head ablation studies (Michel et al. 2019, Voita et al. 2019) find that a sizable fraction of heads can be pruned at inference time with limited loss, indicating that specialization coexists with substantial redundancy.

Computational Complexity

The dominant cost of attention is computing the n×nn \times n attention matrix:

A=softmax(QK/dk)Rn×nA = \text{softmax}(QK^\top / \sqrt{d_k}) \in \mathbb{R}^{n \times n}

  • Compute: QKQK^\top requires O(n2dk)O(n^2 d_k) operations. Multiplying AVA \cdot V requires O(n2dv)O(n^2 d_v). Total: O(n2d)O(n^2 d) where d=hdkd = h \cdot d_k.
  • Memory: Storing AA requires O(n2)O(n^2) per head, or O(n2h)O(n^2 h) total.

For long sequences (n=100,000n = 100{,}000), the attention matrix has 101010^{10} entries per head. This quadratic scaling is the fundamental bottleneck for long-context models and motivates FlashAttention (which reduces memory to O(n)O(n) by tiling, without changing FLOPs), sparse attention, sub-quadratic architectures, and research into attention sinks and retrieval decay in streaming settings.

Connection to Kernel Methods

Proposition

Attention as a Kernel Smoother

Statement

Scaled dot-product attention can be written as a Nadaraya-Watson kernel regression estimator:

outputi=j=1nκ(qi,kj)=1nκ(qi,k)vj\text{output}_i = \sum_{j=1}^{n} \frac{\kappa(q_i, k_j)}{\sum_{\ell=1}^{n} \kappa(q_i, k_\ell)} \, v_j

where the kernel function is κ(q,k)=exp(qk/dk)\kappa(q, k) = \exp(q^\top k / \sqrt{d_k}).

This is a softmax (exponential dot-product) kernel: κ(q,k)=exp(q,k/dk)\kappa(q, k) = \exp(\langle q, k \rangle / \sqrt{d_k}). It is not a translation-invariant RBF kernel. Using qk=(q2+k2qk2)/2q^\top k = (\|q\|^2 + \|k\|^2 - \|q - k\|^2)/2:

exp(qkdk)=exp(q2+k22dk)exp(qk22dk)\exp\left(\frac{q^\top k}{\sqrt{d_k}}\right) = \exp\left(\frac{\|q\|^2 + \|k\|^2}{2\sqrt{d_k}}\right) \cdot \exp\left(-\frac{\|q - k\|^2}{2\sqrt{d_k}}\right)

The left factor breaks translation invariance: unlike the Gaussian RBF kernel, the softmax kernel depends on q\|q\| and k\|k\| separately, not just on qk\|q - k\|. Only when q\|q\| and k\|k\| are (approximately) constant across all query--key pairs does the softmax kernel reduce to an RBF kernel. Layer normalization controls the pre-projection norm but q=WQxq = W_Q x and k=WKxk = W_K x are not norm-constrained, so this equivalence is heuristic rather than exact. Tsai et al. (2019) treat the softmax kernel as an asymmetric dot-product kernel, which is the view we use here.

Intuition

The Nadaraya-Watson estimator is a classical nonparametric regression method: to estimate a function value at a query point, take a weighted average of observed values, where the weights are determined by a kernel measuring similarity between the query point and each data point. Attention is doing exactly this: it estimates the output at each position as a kernel-weighted average of value vectors.

Why It Matters

This connection has two major implications. First, it explains why attention works as a form of nonparametric in-context learning: the model can adapt its behavior at inference time by "regressing" on the input context, without updating weights. Second, it opens the door to efficient attention approximations via random feature maps for kernels (the "Performers" approach), which approximate the softmax kernel with O(n)O(n) complexity.

Failure Mode

The kernel interpretation is cleanest for a single attention head without learned projections. In practice, the learned WQ,WK,WVW_Q, W_K, W_V matrices transform the inputs before the kernel is applied, and multi-head attention applies multiple kernels simultaneously. The kernel analogy is useful for intuition but does not fully capture the representational power of learned multi-head attention.

Common Confusions

Watch Out

Attention weights are NOT learned parameters

The attention weights αij\alpha_{ij} are computed dynamically from the input at every forward pass. They change for every input sequence. The learned parameters are WQ,WK,WV,WOW_Q, W_K, W_V, W_O, which determine how attention is computed, not what the attention pattern is. This input-dependence is the key difference between attention and fixed linear layers.

Watch Out

Additive attention and dot-product attention are different

The original Bahdanau attention (2015) used additive scoring: score(q,k)=vtanh(Wqq+Wkk)\text{score}(q, k) = v^\top \tanh(W_q q + W_k k). Dot-product attention uses score(q,k)=qk\text{score}(q, k) = q^\top k. Vaswani et al. (2017) showed that dot-product attention is faster (it is a single matrix multiplication) and performs comparably when properly scaled. Additive attention is more flexible but slower. Modern transformers overwhelmingly use scaled dot-product attention, typically with one of the now-standard variants (multi-query, grouped-query, or linear-attention approximations) layered on top.

Watch Out

O(n^2) is in the sequence length, not the model dimension

When people say attention is "quadratic," they mean quadratic in nn (sequence length), not dd (model dimension). The cost is O(n2d)O(n^2 d). For a fixed model size, doubling the context window quadruples the attention cost. For a fixed context, doubling the model dimension only doubles the cost.

Summary

  • Attention: softmax(QK/dk)V\text{softmax}(QK^\top / \sqrt{d_k}) V. soft dictionary lookup
  • Scaling by dk\sqrt{d_k} normalizes dot product variance to 1, preventing softmax saturation
  • Without scaling, dot product variance grows as dkd_k, killing gradients
  • Multi-head attention: hh parallel heads with dk=d/hd_k = d/h, same parameter count as single head
  • Self-attention: Q, K, V from same input. Cross-attention: Q from target, K/V from source
  • Attention is a kernel smoother (Nadaraya-Watson estimator with softmax kernel)
  • Computational cost: O(n2d)O(n^2 d) compute, O(n2)O(n^2) memory per head

Exercises

ExerciseCore

Problem

Suppose dk=64d_k = 64 and the entries of qq and kk are i.i.d. with mean 0 and variance 1. What is the standard deviation of the unscaled dot product qkq^\top k? What is the standard deviation after scaling by dk\sqrt{d_k}? If the softmax receives inputs with standard deviation 8, roughly how concentrated will the output distribution be?

ExerciseAdvanced

Problem

Show that attention is permutation-equivariant: if PP is a permutation matrix and X=PXX' = PX, then Attention(XWQ,XWK,XWV)=PAttention(XWQ,XWK,XWV)\text{Attention}(X'W_Q, X'W_K, X'W_V) = P \cdot \text{Attention}(XW_Q, XW_K, XW_V). Why does this mean that a transformer without positional encoding cannot distinguish token order?

ExerciseResearch

Problem

The kernel interpretation says attention uses kernel κ(q,k)=exp(qk/dk)\kappa(q, k) = \exp(q^\top k / \sqrt{d_k}). The "Performers" paper (Choromanski et al., 2021) proposes approximating this kernel with random features ϕ(x)\phi(x) such that κ(q,k)ϕ(q)ϕ(k)\kappa(q, k) \approx \phi(q)^\top \phi(k), enabling O(n)O(n) attention. What is the key mathematical identity that makes this possible, and what is the main accuracy tradeoff?

Related Comparisons

Frequently Asked Questions

Why is attention quadratic in sequence length?
Computing attention requires the matrix of dimension where is the sequence length. Both memory and FLOPs scale as . Long-context LLMs use various techniques (FlashAttention, sparse, sliding-window, ring attention) to reduce or distribute this cost; vanilla attention's quadratic cost is the design constraint that motivated all of them.
What is FlashAttention?
An IO-aware exact attention algorithm (Dao et al. 2022). It tiles the matrix into blocks that fit in fast on-chip SRAM, computes online softmax over tiles, and never materializes the full attention matrix in HBM. Same output as vanilla attention; 2-4x faster wall-clock and uses far less GPU memory at long context.
What is the difference between MHA, MQA, and GQA?
Multi-Head Attention has independent projections. Multi-Query Attention (Shazeer 2019) shares across all heads, cutting KV-cache memory at inference at a small quality cost. Grouped-Query Attention (Ainslie et al. 2023) is the middle ground: groups of heads share . LLaMA 2/3 and many modern LLMs use GQA as the default.
How does sliding-window attention work?
Each token only attends to the last tokens, not the full history. This reduces complexity from to . Common in efficient long-context models (Longformer, Mistral 7B, sparse-attention variants). It trades exactness for compute on long sequences; later work combines sliding window with periodic global tokens to recover long-range dependencies.
Is attention really learning kernels?
Yes — is exactly a kernel-smoother estimate of using the exponential similarity kernel. Performers (Choromanski et al. 2020) and linear attention exploit this view to build approximations with complexity by replacing the softmax kernel with a low-rank feature-map approximation.

References

Canonical:

  • Vaswani et al., "Attention Is All You Need" (NeurIPS 2017), arXiv:1706.03762. The transformer paper. See the paper notes page.
  • Bahdanau, Cho, Bengio, "Neural Machine Translation by Jointly Learning to Align and Translate" (ICLR 2015). Original soft-attention alignment.

Current:

  • Choromanski et al., "Rethinking Attention with Performers" (ICLR 2021). Random-feature kernel approximation.
  • Tsai et al., "Transformer Dissection: An Unified Understanding for Transformer's Attention via the Lens of Kernel" (EMNLP 2019). Asymmetric kernel view.
  • Olsson et al., "In-context Learning and Induction Heads" (2022). Mechanistic analysis of attention heads.
  • Elhage et al., "A Mathematical Framework for Transformer Circuits" (Anthropic, 2021). transformer-circuits.pub. Circuit-level mechanistic interpretability of attention.
  • Michel et al., "Are Sixteen Heads Really Better than One?" (NeurIPS 2019), arXiv:1905.10650. Head ablation study: most heads can be removed at test time with minimal loss.
  • Voita et al., "Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned" (ACL 2019), arXiv:1905.09418. Identifies functional head types; pruning with LRP.
  • Dong, Cordonnier, Loukas, "Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth" (ICML 2021), arXiv:2103.03404. Shows pure-attention networks collapse to rank-1 without skip connections and MLPs.
  • Jurafsky & Martin, Speech and Language Processing (3rd ed., draft), Chapter 9 ("Transformers and Large Language Models").

Next Topics

The natural next steps from attention theory:

  • KV cache: how autoregressive generation avoids recomputing attention
  • Positional encoding: why attention needs position information and the mathematics of RoPE

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

5

Derived topics

15

+10 more on the derived-topics page.