Attention Mechanism Theory

Q: Why is attention quadratic in sequence length?

Computing attention requires the $QK^\top$ matrix of dimension $n \times n$ where $n$ is the sequence length. Both memory and FLOPs scale as $O(n^2 \cdot d)$. Long-context LLMs use various techniques (FlashAttention, sparse, sliding-window, ring attention) to reduce or distribute this cost; vanilla attention's quadratic cost is the design constraint that motivated all of them.

Q: What is the difference between MHA, MQA, and GQA?

Multi-Head Attention has $h$ independent $(Q, K, V)$ projections. Multi-Query Attention (Shazeer 2019) shares $K, V$ across all heads, cutting KV-cache memory at inference at a small quality cost. Grouped-Query Attention (Ainslie et al. 2023) is the middle ground: $g$ groups of heads share $K, V$. LLaMA 2/3 and many modern LLMs use GQA as the default.

Q: How does sliding-window attention work?

Each token only attends to the last $w$ tokens, not the full history. This reduces complexity from $O(n^2)$ to $O(n \cdot w)$. Common in efficient long-context models (Longformer, Mistral 7B, sparse-attention variants). It trades exactness for compute on long sequences; later work combines sliding window with periodic global tokens to recover long-range dependencies.

Q: Is attention really learning kernels?

Yes — $\mathrm{softmax}(QK^\top / \sqrt{d_k}) V$ is exactly a kernel-smoother estimate of $V$ using the exponential similarity kernel. Performers (Choromanski et al. 2020) and linear attention exploit this view to build approximations with $O(n)$ complexity by replacing the softmax kernel with a low-rank feature-map approximation.

Sneiderman, Robby

LLM Construction

Attention Mechanism Theory

Mathematical formulation of attention: scaled dot-product attention as soft dictionary lookup, why scaling by the square root of key dimension prevents softmax saturation, multi-head attention, and the connection to kernel methods.

ImportantAdvancedTier 2CurrentCore spine~60 min

For:ML

Prerequisites

Matrix Operations and Properties Softmax and Numerical Stability Gram Matrices and Kernel Matrices Linear Layer Shapes Memory

Quiz (11)Pulse Check Prereq Map

Why This Matters

theorem visual

Attention Scaling View

$Scaling the dot-product keeps logits inside the softmax regime where multiple keys can still compete.$

Key dimension

d_k = 64

logits

syntax0.72

entity0.96

recency1.08

target1.18

softmax weights

syntax0.19

entity0.24

recency0.27

target0.30

Interpretation

softmax (Q K^{⊤} / d_{k})

Scaling keeps the logits comparable even as key dimension grows, so several keys can still share attention mass.

Good attention keeps multiple plausible matches alive while gradients are still informative.

Attention is the computational primitive that makes transformers work. Every modern LLM. GPT-4, Claude, Gemini, Llama. processes information through billions of attention operations. Understanding attention mathematically means understanding why the specific formula $\text{softmax}(QK^\top / \sqrt{d_k})V$ takes the form it does, what goes wrong without the scaling factor, and how attention relates to classical ideas in statistics and kernel methods.

This topic focuses on the mathematical theory of attention itself, separated from the full transformer architecture, so you can build precise intuition for this single operation before composing it into larger systems.

Mental Model

Attention is a soft dictionary lookup. You have a query ("what am I looking for?"), a set of keys ("what does each entry contain?"), and a set of values ("what information does each entry carry?"). The query is compared against all keys to produce similarity scores. These scores become weights via softmax, and the output is a weighted sum of values.

Unlike a hard dictionary lookup (which returns the value of the exact matching key), attention returns a blend of all values, weighted by how well each key matches the query. This soft matching is what allows attention to combine information from multiple positions in a differentiable way.

Formal Setup and Notation

Let $n$ be the sequence length and $d$ the model dimension. The input is a matrix $X \in \mathbb{R}^{n \times d}$ where each row is a token embedding.

We project the input into three spaces:

$Q = XW_Q \in \mathbb{R}^{n \times d_k}, \quad K = XW_K \in \mathbb{R}^{n \times d_k}, \quad V = XW_V \in \mathbb{R}^{n \times d_v}$

where $W_Q, W_K \in \mathbb{R}^{d \times d_k}$ and $W_V \in \mathbb{R}^{d \times d_v}$ are learned projection matrices.

Scaled Dot-Product Attention

Definition

Scaled Dot-Product Attention

The scaled dot-product attention function is:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V$

where the softmax is applied independently to each row of the $n \times n$ matrix $QK^\top / \sqrt{d_k}$ .

For a single query $q_i$ (the $i$ -th row of $Q$ ), the output is:

$\text{output}_i = \sum_{j=1}^{n} \alpha_{ij} \, v_j, \qquad \alpha_{ij} = \frac{\exp(q_i^\top k_j / \sqrt{d_k})}{\sum_{\ell=1}^{n} \exp(q_i^\top k_\ell / \sqrt{d_k})}$

The attention weights $\alpha_{ij}$ form a probability distribution over positions: $\alpha_{ij} \geq 0$ and $\sum_j \alpha_{ij} = 1$ .

Why Scale by $\sqrt{d_k}$

Proposition

Scaling Prevents Softmax Saturation

Statement

Assume $q_1, \ldots, q_{d_k}, k_1, \ldots, k_{d_k}$ are mutually independent random variables, each with mean $0$ and variance $1$ . Then the dot product $q^\top k$ has:

$\mathbb{E}[q^\top k] = 0, \qquad \text{Var}(q^\top k) = d_k$

The scaled dot product $q^\top k / \sqrt{d_k}$ has variance $1$ , regardless of $d_k$ .

Intuition

The dot product is a sum of $d_k$ independent terms $q_i k_i$ , each with variance $1$ . By the independence assumption, the variance of the sum is the sum of the variances: $d_k$ . Without scaling, as $d_k$ grows, the dot products grow in magnitude, pushing the softmax inputs into regions where the gradient is near zero (softmax saturation). Dividing by $\sqrt{d_k}$ normalizes the variance to $1$ , keeping the softmax in a regime with useful gradients.

Proof Sketch

Each entry $q_i k_i$ has $\mathbb{E}[q_i k_i] = \mathbb{E}[q_i]\mathbb{E}[k_i] = 0$ and $\text{Var}(q_i k_i) = \mathbb{E}[q_i^2]\mathbb{E}[k_i^2] = 1 \cdot 1 = 1$ (using independence and the fact that mean is zero).

The dot product $q^\top k = \sum_{i=1}^{d_k} q_i k_i$ is a sum of $d_k$ independent random variables, so $\text{Var}(q^\top k) = d_k$ .

After scaling: $\text{Var}(q^\top k / \sqrt{d_k}) = d_k / d_k = 1$ .

Why It Matters

Without scaling, a model with $d_k = 512$ would have dot products with standard deviation $\sqrt{512} \approx 22.6$ . Softmax inputs of magnitude 20+ produce outputs extremely close to 0 or 1, with gradients on the order of $10^{-9}$ . Training would effectively freeze. The $\sqrt{d_k}$ scaling is not a minor numerical convenience. It is essential for trainability.

Failure Mode

The assumption that $q$ and $k$ entries are independent with unit variance holds approximately at initialization (with proper weight initialization) but may not hold after training. In practice, the model learns to calibrate its own attention logits, so the scaling factor becomes less critical later in training. However, removing it entirely still causes training instability.

report a correction →

Attention as Soft Dictionary Lookup

Definition

Attention as Soft Dictionary Lookup

A hard dictionary lookup with query $q$ , keys $\{k_1, \ldots, k_n\}$ , and values $\{v_1, \ldots, v_n\}$ returns $v_{j^*}$ where $j^* = \arg\max_j \, \text{sim}(q, k_j)$ for some similarity function.

Attention replaces the hard $\arg\max$ with a soft weighting:

$\text{output} = \sum_{j=1}^{n} \underbrace{\frac{\exp(\text{sim}(q, k_j))}{\sum_\ell \exp(\text{sim}(q, k_\ell))}}_{\text{soft selection weight}} \cdot v_j$

The output is a convex combination of all values, with higher weight on values whose keys are more similar to the query.

In the transformer, the similarity function is $\text{sim}(q, k) = q^\top k / \sqrt{d_k}$ . Scaling by $1/\sqrt{d_k}$ is a fixed normalization, not a learned temperature. When dot-product magnitudes grow (at inference time with unusually large query norms, or when studying asymptotic behavior) the softmax becomes peaky and the soft lookup approaches a hard one.

Watch Out

Attention is not literally dictionary lookup

The soft-dictionary analogy is a useful mental model for the mechanics of a single attention head. It does not capture what attention computes in a trained transformer. Learned $Q, K, V$ projections make queries and keys live in spaces that have no relation to a human-interpretable notion of "matching." Mechanistic interpretability work (Elhage et al. 2021, Olsson et al. 2022) shows heads implementing copying, induction, positional patterns, and composition. Use the analogy to understand the formula, not to predict what heads do.

Self-Attention vs Cross-Attention

Definition

Self-Attention

In self-attention, queries, keys, and values all come from the same input sequence:

$Q = XW_Q, \quad K = XW_K, \quad V = XW_V$

Each token attends to all tokens in the same sequence (including itself). Self-attention is how a model builds contextual representations.

Definition

Cross-Attention

In cross-attention, queries come from one sequence and keys/values come from another:

$Q = X_{\text{target}} W_Q, \quad K = X_{\text{source}} W_K, \quad V = X_{\text{source}} W_V$

This is used in encoder-decoder models (e.g., for translation: the decoder attends to the encoder output) and in retrieval-augmented generation.

The mathematical formulation is identical. The only difference is whether $Q$ and $K, V$ derive from the same or different input matrices.

Watch Out

Decoder self-attention is causally masked

The unmasked formulation above lets every position attend to every other position. Decoder-only transformers (GPT-style) and the decoder of encoder-decoder models apply a causal mask: the logit $q_i^\top k_j /\sqrt{d_k}$ is set to $-\infty$ for $j > i$ before the softmax, so position $i$ only attends to positions $\leq i$ . This is what makes autoregressive next-token prediction well-defined and is also what makes the KV cache correct, since past keys and values do not depend on future tokens.

Multi-Head Attention

Definition

Multi-Head Attention $M H A$

Instead of a single attention function, compute $h$ heads in parallel on lower-dimensional projections:

$\text{head}_i = \text{Attention}(XW_Q^{(i)}, XW_K^{(i)}, XW_V^{(i)})$

where $W_Q^{(i)}, W_K^{(i)} \in \mathbb{R}^{d \times d_k}$ and $W_V^{(i)} \in \mathbb{R}^{d \times d_v}$ with $d_k = d_v = d / h$ .

Concatenate and project:

$\text{MHA}(X) = [\text{head}_1; \ldots; \text{head}_h] \, W_O$

where $W_O \in \mathbb{R}^{d \times d}$ .

Proposition

Multi-Head Attention Representational Capacity

Statement

Under the Vaswani et al. (2017) convention $d_k = d_v = d / h$ and ignoring bias terms, multi-head attention has the same weight parameter count as single-head attention with full head dimension $d$ :

$\text{Params(MHA)} = h \cdot 3 \cdot d \cdot (d/h) + d^2 = 3d^2 + d^2 = 4d^2$

$\text{Params(single-head)} = 3 \cdot d \cdot d + d^2 = 4d^2$

However, MHA can represent richer attention patterns: each head can specialize in a different type of relationship (syntactic, semantic, positional), and the output projection $W_O$ learns how to combine these patterns.

The equality breaks under other widely used choices. Including biases adds $4d$ parameters. Multi-query attention (MQA) shares $W_K, W_V$ across all heads, giving $d^2 + 2d \cdot (d/h) + d^2$ . Grouped-query attention (GQA) interpolates between these extremes. See the MQA/GQA page for the exact counts.

Intuition

Multiple heads give the model multiple independent "attention channels." One head might track subject-verb agreement, another might track coreference, a third might focus on adjacent tokens. Single-head attention is forced to compress all these patterns into a single set of attention weights, which is a lossy compression. Multi-head attention avoids this by giving each pattern its own subspace.

Why It Matters

Empirically, reducing to a single head significantly degrades performance. The multi-head structure is one of the most important design decisions in the transformer. Mechanistic interpretability research has shown that individual heads do specialize: there are "induction heads" that copy patterns, "previous token heads" that attend to the immediately preceding token, and "name mover heads" that track entities. The "one head, one function" picture is a simplification: many heads are polysemantic (a single head implements several behaviors that are entangled across inputs), and head ablation studies (Michel et al. 2019, Voita et al. 2019) find that a sizable fraction of heads can be pruned at inference time with limited loss, indicating that specialization coexists with substantial redundancy.

report a correction →

Computational Complexity

The dominant cost of attention is computing the $n \times n$ attention matrix:

$A = \text{softmax}(QK^\top / \sqrt{d_k}) \in \mathbb{R}^{n \times n}$

Compute: $QK^\top$ requires $O(n^2 d_k)$ operations. Multiplying $A \cdot V$ requires $O(n^2 d_v)$ . Total: $O(n^2 d)$ where $d = h \cdot d_k$ .
Memory: Storing $A$ requires $O(n^2)$ per head, or $O(n^2 h)$ total.

For long sequences ( $n = 100{,}000$ ), the attention matrix has $10^{10}$ entries per head. This quadratic scaling is the fundamental bottleneck for long-context models and motivates FlashAttention (which reduces memory to $O(n)$ by tiling, without changing FLOPs), sparse attention, sub-quadratic architectures, and research into attention sinks and retrieval decay in streaming settings.

Connection to Kernel Methods

Proposition

Attention as a Kernel Smoother

Statement

Scaled dot-product attention can be written as a Nadaraya-Watson kernel regression estimator:

$\text{output}_i = \sum_{j=1}^{n} \frac{\kappa(q_i, k_j)}{\sum_{\ell=1}^{n} \kappa(q_i, k_\ell)} \, v_j$

where the kernel function is $\kappa(q, k) = \exp(q^\top k / \sqrt{d_k})$ .

This is a softmax (exponential dot-product) kernel: $\kappa(q, k) = \exp(\langle q, k \rangle / \sqrt{d_k})$ . It is not a translation-invariant RBF kernel. Using $q^\top k = (\|q\|^2 + \|k\|^2 - \|q - k\|^2)/2$ :

$\exp\left(\frac{q^\top k}{\sqrt{d_k}}\right) = \exp\left(\frac{\|q\|^2 + \|k\|^2}{2\sqrt{d_k}}\right) \cdot \exp\left(-\frac{\|q - k\|^2}{2\sqrt{d_k}}\right)$

The left factor breaks translation invariance: unlike the Gaussian RBF kernel, the softmax kernel depends on $\|q\|$ and $\|k\|$ separately, not just on $\|q - k\|$ . Only when $\|q\|$ and $\|k\|$ are (approximately) constant across all query--key pairs does the softmax kernel reduce to an RBF kernel. Layer normalization controls the pre-projection norm but $q = W_Q x$ and $k = W_K x$ are not norm-constrained, so this equivalence is heuristic rather than exact. Tsai et al. (2019) treat the softmax kernel as an asymmetric dot-product kernel, which is the view we use here.

Intuition

The Nadaraya-Watson estimator is a classical nonparametric regression method: to estimate a function value at a query point, take a weighted average of observed values, where the weights are determined by a kernel measuring similarity between the query point and each data point. Attention is doing exactly this: it estimates the output at each position as a kernel-weighted average of value vectors.

Why It Matters

This connection has two major implications. First, it explains why attention works as a form of nonparametric in-context learning: the model can adapt its behavior at inference time by "regressing" on the input context, without updating weights. Second, it opens the door to efficient attention approximations via random feature maps for kernels (the "Performers" approach), which approximate the softmax kernel with $O(n)$ complexity.

Failure Mode

The kernel interpretation is cleanest for a single attention head without learned projections. In practice, the learned $W_Q, W_K, W_V$ matrices transform the inputs before the kernel is applied, and multi-head attention applies multiple kernels simultaneously. The kernel analogy is useful for intuition but does not fully capture the representational power of learned multi-head attention.

report a correction →

Common Confusions

Watch Out

Attention weights are NOT learned parameters

The attention weights $\alpha_{ij}$ are computed dynamically from the input at every forward pass. They change for every input sequence. The learned parameters are $W_Q, W_K, W_V, W_O$ , which determine how attention is computed, not what the attention pattern is. This input-dependence is the key difference between attention and fixed linear layers.

Watch Out

Additive attention and dot-product attention are different

The original Bahdanau attention (2015) used additive scoring: $\text{score}(q, k) = v^\top \tanh(W_q q + W_k k)$ . Dot-product attention uses $\text{score}(q, k) = q^\top k$ . Vaswani et al. (2017) showed that dot-product attention is faster (it is a single matrix multiplication) and performs comparably when properly scaled. Additive attention is more flexible but slower. Modern transformers overwhelmingly use scaled dot-product attention, typically with one of the now-standard variants (multi-query, grouped-query, or linear-attention approximations) layered on top.

Watch Out

O(n^2) is in the sequence length, not the model dimension

When people say attention is "quadratic," they mean quadratic in $n$ (sequence length), not $d$ (model dimension). The cost is $O(n^2 d)$ . For a fixed model size, doubling the context window quadruples the attention cost. For a fixed context, doubling the model dimension only doubles the cost.

Summary

Attention: $\text{softmax}(QK^\top / \sqrt{d_k}) V$ . soft dictionary lookup
Scaling by $\sqrt{d_k}$ normalizes dot product variance to 1, preventing softmax saturation
Without scaling, dot product variance grows as $d_k$ , killing gradients
Multi-head attention: $h$ parallel heads with $d_k = d/h$ , same parameter count as single head
Self-attention: Q, K, V from same input. Cross-attention: Q from target, K/V from source
Attention is a kernel smoother (Nadaraya-Watson estimator with softmax kernel)
Computational cost: $O(n^2 d)$ compute, $O(n^2)$ memory per head

Exercises

ExerciseCore

Problem

Suppose $d_k = 64$ and the entries of $q$ and $k$ are i.i.d. with mean 0 and variance 1. What is the standard deviation of the unscaled dot product $q^\top k$ ? What is the standard deviation after scaling by $\sqrt{d_k}$ ? If the softmax receives inputs with standard deviation 8, roughly how concentrated will the output distribution be?

ExerciseAdvanced

Problem

Show that attention is permutation-equivariant: if $P$ is a permutation matrix and $X' = PX$ , then $\text{Attention}(X'W_Q, X'W_K, X'W_V) = P \cdot \text{Attention}(XW_Q, XW_K, XW_V)$ . Why does this mean that a transformer without positional encoding cannot distinguish token order?

ExerciseResearch

Problem

The kernel interpretation says attention uses kernel $\kappa(q, k) = \exp(q^\top k / \sqrt{d_k})$ . The "Performers" paper (Choromanski et al., 2021) proposes approximating this kernel with random features $\phi(x)$ such that $\kappa(q, k) \approx \phi(q)^\top \phi(k)$ , enabling $O(n)$ attention. What is the key mathematical identity that makes this possible, and what is the main accuracy tradeoff?

Related Comparisons

Frequently Asked Questions

$Why is attention quadratic in sequence length?$: $Computing attention requires the Q K^{⊤} matrix of dimension n \times n where n is the sequence length. Both memory and FLOPs scale as O (n^{2} \cdot d) . Long-context LLMs use various techniques (FlashAttention, sparse, sliding-window, ring attention) to reduce or distribute this cost; vanilla attention's quadratic cost is the design constraint that motivated all of them.$
$What is FlashAttention?$: $An IO-aware exact attention algorithm (Dao et al. 2022). It tiles the Q K^{⊤} matrix into blocks that fit in fast on-chip SRAM, computes online softmax over tiles, and never materializes the full attention matrix in HBM. Same output as vanilla attention; 2-4x faster wall-clock and uses far less GPU memory at long context.$
$What is the difference between MHA, MQA, and GQA?$: $Multi-Head Attention has h independent (Q, K, V) projections. Multi-Query Attention (Shazeer 2019) shares K, V across all heads, cutting KV-cache memory at inference at a small quality cost. Grouped-Query Attention (Ainslie et al. 2023) is the middle ground: g groups of heads share K, V . LLaMA 2/3 and many modern LLMs use GQA as the default.$
$How does sliding-window attention work?$: $Each token only attends to the last w tokens, not the full history. This reduces complexity from O (n^{2}) to O (n \cdot w) . Common in efficient long-context models (Longformer, Mistral 7B, sparse-attention variants). It trades exactness for compute on long sequences; later work combines sliding window with periodic global tokens to recover long-range dependencies.$
$Is attention really learning kernels?$: $Yes — softmax (Q K^{⊤} / d_{k}) V is exactly a kernel-smoother estimate of V using the exponential similarity kernel. Performers (Choromanski et al. 2020) and linear attention exploit this view to build approximations with O (n) complexity by replacing the softmax kernel with a low-rank feature-map approximation.$

References

Canonical:

Vaswani et al., "Attention Is All You Need" (NeurIPS 2017), arXiv:1706.03762. The transformer paper. See the paper notes page.
Bahdanau, Cho, Bengio, "Neural Machine Translation by Jointly Learning to Align and Translate" (ICLR 2015). Original soft-attention alignment.

Current:

Choromanski et al., "Rethinking Attention with Performers" (ICLR 2021). Random-feature kernel approximation.
Tsai et al., "Transformer Dissection: An Unified Understanding for Transformer's Attention via the Lens of Kernel" (EMNLP 2019). Asymmetric kernel view.
Olsson et al., "In-context Learning and Induction Heads" (2022). Mechanistic analysis of attention heads.
Elhage et al., "A Mathematical Framework for Transformer Circuits" (Anthropic, 2021). transformer-circuits.pub. Circuit-level mechanistic interpretability of attention.
Michel et al., "Are Sixteen Heads Really Better than One?" (NeurIPS 2019), arXiv:1905.10650. Head ablation study: most heads can be removed at test time with minimal loss.
Voita et al., "Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned" (ACL 2019), arXiv:1905.09418. Identifies functional head types; pruning with LRP.
Dong, Cordonnier, Loukas, "Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth" (ICML 2021), arXiv:2103.03404. Shows pure-attention networks collapse to rank-1 without skip connections and MLPs.
Jurafsky & Martin, Speech and Language Processing (3rd ed., draft), Chapter 9 ("Transformers and Large Language Models").

Next Topics

The natural next steps from attention theory:

KV cache: how autoregressive generation avoids recomputing attention
Positional encoding: why attention needs position information and the mathematics of RoPE

Last reviewed: April 26, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

5

Matrix Operations and Propertieslayer 0A · tier 1
Gram Matrices and Kernel Matriceslayer 1 · tier 1
Softmax and Numerical Stabilitylayer 1 · tier 1
Linear Layer: Shapes, Bias, and Memorylayer 2 · tier 1
Word Embeddingslayer 2 · tier 2

Derived topics

15

Attention Sinks and Retrieval Decaylayer 4 · tier 2
Attention Variants and Efficiencylayer 4 · tier 2
Forgetting Transformer (FoX)layer 4 · tier 2
Induction Headslayer 4 · tier 2
Mamba and State-Space Modelslayer 4 · tier 2

+10 more on the derived-topics page.

Graph-backed continuations

KV Cache Positional Encoding Attention as Kernel Regression Attention for Protein Structure: AlphaFold and Successors Attention Sinks and Retrieval Decay Attention Variants and Efficiency Context Engineering Flash Attention Forgetting Transformer (FoX)GPT Series Evolution Induction Heads Mamba and State-Space Models Mistral Models Sparse Attention and Long Context Transformer Architecture

Why This Matters

Attention Scaling View

Mental Model

Formal Setup and Notation

Scaled Dot-Product Attention

Why Scale by dk\sqrt{d_k}dk​​

Attention as Soft Dictionary Lookup

Self-Attention vs Cross-Attention

Multi-Head Attention

Computational Complexity

Connection to Kernel Methods

Common Confusions

Summary

Exercises

Related Comparisons

Frequently Asked Questions

References

Next Topics

Required before and derived from this topic

Required prerequisites

Derived topics

Why Scale by $\sqrt{d_k}$