LLM Construction
Attention Mechanism Theory
Mathematical formulation of attention: scaled dot-product attention as soft dictionary lookup, why scaling by the square root of key dimension prevents softmax saturation, multi-head attention, and the connection to kernel methods.
Prerequisites
Why This Matters
Attention is the computational primitive that makes transformers work. Every modern LLM. GPT-4, Claude, Gemini, Llama. processes information through billions of attention operations. Understanding attention mathematically means understanding why the specific formula takes the form it does, what goes wrong without the scaling factor, and how attention relates to classical ideas in statistics and kernel methods.
This topic focuses on the mathematical theory of attention itself, separated from the full transformer architecture, so you can build precise intuition for this single operation before composing it into larger systems.
Mental Model
Attention is a soft dictionary lookup. You have a query ("what am I looking for?"), a set of keys ("what does each entry contain?"), and a set of values ("what information does each entry carry?"). The query is compared against all keys to produce similarity scores. These scores become weights via softmax, and the output is a weighted sum of values.
Unlike a hard dictionary lookup (which returns the value of the exact matching key), attention returns a blend of all values, weighted by how well each key matches the query. This soft matching is what allows attention to combine information from multiple positions in a differentiable way.
Formal Setup and Notation
Let be the sequence length and the model dimension. The input is a matrix where each row is a token embedding.
We project the input into three spaces:
where and are learned projection matrices.
Scaled Dot-Product Attention
Scaled Dot-Product Attention
The scaled dot-product attention function is:
where the softmax is applied independently to each row of the matrix .
For a single query (the -th row of ), the output is:
The attention weights form a probability distribution over positions: and .
Why Scale by
Scaling Prevents Softmax Saturation
Statement
Assume are mutually independent random variables, each with mean and variance . Then the dot product has:
The scaled dot product has variance , regardless of .
Intuition
The dot product is a sum of independent terms , each with variance . By the independence assumption, the variance of the sum is the sum of the variances: . Without scaling, as grows, the dot products grow in magnitude, pushing the softmax inputs into regions where the gradient is near zero (softmax saturation). Dividing by normalizes the variance to , keeping the softmax in a regime with useful gradients.
Proof Sketch
Each entry has and (using independence and the fact that mean is zero).
The dot product is a sum of independent random variables, so .
After scaling: .
Why It Matters
Without scaling, a model with would have dot products with standard deviation . Softmax inputs of magnitude 20+ produce outputs extremely close to 0 or 1, with gradients on the order of . Training would effectively freeze. The scaling is not a minor numerical convenience. It is essential for trainability.
Failure Mode
The assumption that and entries are independent with unit variance holds approximately at initialization (with proper weight initialization) but may not hold after training. In practice, the model learns to calibrate its own attention logits, so the scaling factor becomes less critical later in training. However, removing it entirely still causes training instability.
Attention as Soft Dictionary Lookup
Attention as Soft Dictionary Lookup
A hard dictionary lookup with query , keys , and values returns where for some similarity function.
Attention replaces the hard with a soft weighting:
The output is a convex combination of all values, with higher weight on values whose keys are more similar to the query.
In the transformer, the similarity function is . Scaling by is a fixed normalization, not a learned temperature. When dot-product magnitudes grow (at inference time with unusually large query norms, or when studying asymptotic behavior) the softmax becomes peaky and the soft lookup approaches a hard one.
Attention is not literally dictionary lookup
The soft-dictionary analogy is a useful mental model for the mechanics of a single attention head. It does not capture what attention computes in a trained transformer. Learned projections make queries and keys live in spaces that have no relation to a human-interpretable notion of "matching." Mechanistic interpretability work (Elhage et al. 2021, Olsson et al. 2022) shows heads implementing copying, induction, positional patterns, and composition. Use the analogy to understand the formula, not to predict what heads do.
Self-Attention vs Cross-Attention
Self-Attention
In self-attention, queries, keys, and values all come from the same input sequence:
Each token attends to all tokens in the same sequence (including itself). Self-attention is how a model builds contextual representations.
Cross-Attention
In cross-attention, queries come from one sequence and keys/values come from another:
This is used in encoder-decoder models (e.g., for translation: the decoder attends to the encoder output) and in retrieval-augmented generation.
The mathematical formulation is identical. The only difference is whether and derive from the same or different input matrices.
Decoder self-attention is causally masked
The unmasked formulation above lets every position attend to every other position. Decoder-only transformers (GPT-style) and the decoder of encoder-decoder models apply a causal mask: the logit is set to for before the softmax, so position only attends to positions . This is what makes autoregressive next-token prediction well-defined and is also what makes the KV cache correct, since past keys and values do not depend on future tokens.
Multi-Head Attention
Multi-Head Attention
Instead of a single attention function, compute heads in parallel on lower-dimensional projections:
where and with .
Concatenate and project:
where .
Multi-Head Attention Representational Capacity
Statement
Under the Vaswani et al. (2017) convention and ignoring bias terms, multi-head attention has the same weight parameter count as single-head attention with full head dimension :
However, MHA can represent richer attention patterns: each head can specialize in a different type of relationship (syntactic, semantic, positional), and the output projection learns how to combine these patterns.
The equality breaks under other widely used choices. Including biases adds parameters. Multi-query attention (MQA) shares across all heads, giving . Grouped-query attention (GQA) interpolates between these extremes. See the MQA/GQA page for the exact counts.
Intuition
Multiple heads give the model multiple independent "attention channels." One head might track subject-verb agreement, another might track coreference, a third might focus on adjacent tokens. Single-head attention is forced to compress all these patterns into a single set of attention weights, which is a lossy compression. Multi-head attention avoids this by giving each pattern its own subspace.
Why It Matters
Empirically, reducing to a single head significantly degrades performance. The multi-head structure is one of the most important design decisions in the transformer. Mechanistic interpretability research has shown that individual heads do specialize: there are "induction heads" that copy patterns, "previous token heads" that attend to the immediately preceding token, and "name mover heads" that track entities. The "one head, one function" picture is a simplification: many heads are polysemantic (a single head implements several behaviors that are entangled across inputs), and head ablation studies (Michel et al. 2019, Voita et al. 2019) find that a sizable fraction of heads can be pruned at inference time with limited loss, indicating that specialization coexists with substantial redundancy.
Computational Complexity
The dominant cost of attention is computing the attention matrix:
- Compute: requires operations. Multiplying requires . Total: where .
- Memory: Storing requires per head, or total.
For long sequences (), the attention matrix has entries per head. This quadratic scaling is the fundamental bottleneck for long-context models and motivates FlashAttention (which reduces memory to by tiling, without changing FLOPs), sparse attention, sub-quadratic architectures, and research into attention sinks and retrieval decay in streaming settings.
Connection to Kernel Methods
Attention as a Kernel Smoother
Statement
Scaled dot-product attention can be written as a Nadaraya-Watson kernel regression estimator:
where the kernel function is .
This is a softmax (exponential dot-product) kernel: . It is not a translation-invariant RBF kernel. Using :
The left factor breaks translation invariance: unlike the Gaussian RBF kernel, the softmax kernel depends on and separately, not just on . Only when and are (approximately) constant across all query--key pairs does the softmax kernel reduce to an RBF kernel. Layer normalization controls the pre-projection norm but and are not norm-constrained, so this equivalence is heuristic rather than exact. Tsai et al. (2019) treat the softmax kernel as an asymmetric dot-product kernel, which is the view we use here.
Intuition
The Nadaraya-Watson estimator is a classical nonparametric regression method: to estimate a function value at a query point, take a weighted average of observed values, where the weights are determined by a kernel measuring similarity between the query point and each data point. Attention is doing exactly this: it estimates the output at each position as a kernel-weighted average of value vectors.
Why It Matters
This connection has two major implications. First, it explains why attention works as a form of nonparametric in-context learning: the model can adapt its behavior at inference time by "regressing" on the input context, without updating weights. Second, it opens the door to efficient attention approximations via random feature maps for kernels (the "Performers" approach), which approximate the softmax kernel with complexity.
Failure Mode
The kernel interpretation is cleanest for a single attention head without learned projections. In practice, the learned matrices transform the inputs before the kernel is applied, and multi-head attention applies multiple kernels simultaneously. The kernel analogy is useful for intuition but does not fully capture the representational power of learned multi-head attention.
Common Confusions
Attention weights are NOT learned parameters
The attention weights are computed dynamically from the input at every forward pass. They change for every input sequence. The learned parameters are , which determine how attention is computed, not what the attention pattern is. This input-dependence is the key difference between attention and fixed linear layers.
Additive attention and dot-product attention are different
The original Bahdanau attention (2015) used additive scoring: . Dot-product attention uses . Vaswani et al. (2017) showed that dot-product attention is faster (it is a single matrix multiplication) and performs comparably when properly scaled. Additive attention is more flexible but slower. Modern transformers overwhelmingly use scaled dot-product attention, typically with one of the now-standard variants (multi-query, grouped-query, or linear-attention approximations) layered on top.
O(n^2) is in the sequence length, not the model dimension
When people say attention is "quadratic," they mean quadratic in (sequence length), not (model dimension). The cost is . For a fixed model size, doubling the context window quadruples the attention cost. For a fixed context, doubling the model dimension only doubles the cost.
Summary
- Attention: . soft dictionary lookup
- Scaling by normalizes dot product variance to 1, preventing softmax saturation
- Without scaling, dot product variance grows as , killing gradients
- Multi-head attention: parallel heads with , same parameter count as single head
- Self-attention: Q, K, V from same input. Cross-attention: Q from target, K/V from source
- Attention is a kernel smoother (Nadaraya-Watson estimator with softmax kernel)
- Computational cost: compute, memory per head
Exercises
Problem
Suppose and the entries of and are i.i.d. with mean 0 and variance 1. What is the standard deviation of the unscaled dot product ? What is the standard deviation after scaling by ? If the softmax receives inputs with standard deviation 8, roughly how concentrated will the output distribution be?
Problem
Show that attention is permutation-equivariant: if is a permutation matrix and , then . Why does this mean that a transformer without positional encoding cannot distinguish token order?
Problem
The kernel interpretation says attention uses kernel . The "Performers" paper (Choromanski et al., 2021) proposes approximating this kernel with random features such that , enabling attention. What is the key mathematical identity that makes this possible, and what is the main accuracy tradeoff?
Related Comparisons
Frequently Asked Questions
- Why is attention quadratic in sequence length?
- Computing attention requires the matrix of dimension where is the sequence length. Both memory and FLOPs scale as . Long-context LLMs use various techniques (FlashAttention, sparse, sliding-window, ring attention) to reduce or distribute this cost; vanilla attention's quadratic cost is the design constraint that motivated all of them.
- What is FlashAttention?
- An IO-aware exact attention algorithm (Dao et al. 2022). It tiles the matrix into blocks that fit in fast on-chip SRAM, computes online softmax over tiles, and never materializes the full attention matrix in HBM. Same output as vanilla attention; 2-4x faster wall-clock and uses far less GPU memory at long context.
- What is the difference between MHA, MQA, and GQA?
- Multi-Head Attention has independent projections. Multi-Query Attention (Shazeer 2019) shares across all heads, cutting KV-cache memory at inference at a small quality cost. Grouped-Query Attention (Ainslie et al. 2023) is the middle ground: groups of heads share . LLaMA 2/3 and many modern LLMs use GQA as the default.
- How does sliding-window attention work?
- Each token only attends to the last tokens, not the full history. This reduces complexity from to . Common in efficient long-context models (Longformer, Mistral 7B, sparse-attention variants). It trades exactness for compute on long sequences; later work combines sliding window with periodic global tokens to recover long-range dependencies.
- Is attention really learning kernels?
- Yes — is exactly a kernel-smoother estimate of using the exponential similarity kernel. Performers (Choromanski et al. 2020) and linear attention exploit this view to build approximations with complexity by replacing the softmax kernel with a low-rank feature-map approximation.
References
Canonical:
- Vaswani et al., "Attention Is All You Need" (NeurIPS 2017), arXiv:1706.03762. The transformer paper. See the paper notes page.
- Bahdanau, Cho, Bengio, "Neural Machine Translation by Jointly Learning to Align and Translate" (ICLR 2015). Original soft-attention alignment.
Current:
- Choromanski et al., "Rethinking Attention with Performers" (ICLR 2021). Random-feature kernel approximation.
- Tsai et al., "Transformer Dissection: An Unified Understanding for Transformer's Attention via the Lens of Kernel" (EMNLP 2019). Asymmetric kernel view.
- Olsson et al., "In-context Learning and Induction Heads" (2022). Mechanistic analysis of attention heads.
- Elhage et al., "A Mathematical Framework for Transformer Circuits" (Anthropic, 2021). transformer-circuits.pub. Circuit-level mechanistic interpretability of attention.
- Michel et al., "Are Sixteen Heads Really Better than One?" (NeurIPS 2019), arXiv:1905.10650. Head ablation study: most heads can be removed at test time with minimal loss.
- Voita et al., "Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned" (ACL 2019), arXiv:1905.09418. Identifies functional head types; pruning with LRP.
- Dong, Cordonnier, Loukas, "Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth" (ICML 2021), arXiv:2103.03404. Shows pure-attention networks collapse to rank-1 without skip connections and MLPs.
- Jurafsky & Martin, Speech and Language Processing (3rd ed., draft), Chapter 9 ("Transformers and Large Language Models").
Next Topics
The natural next steps from attention theory:
- KV cache: how autoregressive generation avoids recomputing attention
- Positional encoding: why attention needs position information and the mathematics of RoPE
Last reviewed: April 26, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
5- Matrix Operations and Propertieslayer 0A · tier 1
- Gram Matrices and Kernel Matriceslayer 1 · tier 1
- Softmax and Numerical Stabilitylayer 1 · tier 1
- Linear Layer: Shapes, Bias, and Memorylayer 2 · tier 1
- Word Embeddingslayer 2 · tier 2
Derived topics
15- Attention Sinks and Retrieval Decaylayer 4 · tier 2
- Attention Variants and Efficiencylayer 4 · tier 2
- Forgetting Transformer (FoX)layer 4 · tier 2
- Induction Headslayer 4 · tier 2
- Mamba and State-Space Modelslayer 4 · tier 2
+10 more on the derived-topics page.
Graph-backed continuations