Skip to main content

Comparison

Transformer vs. Mamba vs. TTT

Three competing sequence architectures: attention (exact retrieval, quadratic cost), state-space models (linear cost, compressed state), and test-time training (gradient-based state updates, rich memory). Each makes different tradeoffs between memory, compute, and retrieval ability.

What Each Does

Transformers process sequences by letting every token attend to every other token. The KV cache stores all past tokens explicitly. Cost per token: O(nd)O(nd) where nn is sequence length. Memory: grows linearly with context.

Mamba (SSMs) processes sequences through a learned linear recurrence. The state is a fixed-size matrix hRd×N\mathbf{h} \in \mathbb{R}^{d \times N} (typically N=16N = 16). Cost per token: O(dN)O(dN). Memory: constant regardless of context length.

TTT (Test-Time Training) processes sequences by updating a weight matrix WRdinner×dinnerW \in \mathbb{R}^{d_{\text{inner}} \times d_{\text{inner}}} via gradient descent on a self-supervised loss at each token. Cost per token: O(dinner2)O(d_{\text{inner}}^2). Memory: constant (the weight matrix).

Side-by-Side Comparison

PropertyTransformerMambaTTT
State typeKV cache (all past tokens)Fixed-size vector/matrixWeight matrix
State sizeO(nd)O(n \cdot d)O(dN)O(d \cdot N), N16N \approx 16O(dinner2)O(d_{\text{inner}}^2)
Per-token costO(nd)O(n \cdot d)O(dN)O(d \cdot N)O(dinner2)O(d_{\text{inner}}^2)
Retrieval abilityExact (any past token)Approximate (compressed)Learned (gradient-based)
Information capacityUnbounded (grows with nn)O(dN)O(dN) bitsO(dinner2)O(d_{\text{inner}}^2) bits
Long-context scalingQuadratic costLinear costLinear cost
Parallelizable (training)YesYes (parallel scan)Partially
In-context learningStrong (induction heads)Weak (no content-addressable memory)Moderate to strong

When Each Wins

Transformers win when:

  • Precise retrieval matters (copying, lookup, in-context few-shot learning)
  • Sequence length is moderate (< 8K tokens)
  • You need the strongest possible language modeling quality
  • Hardware supports efficient attention (FlashAttention on modern GPUs)

Mamba wins when:

  • Sequence length is very long (> 32K tokens) and cost must be linear
  • The task is primarily about aggregation, not retrieval (audio, genomics, time series)
  • Inference latency per token must be constant regardless of context

TTT wins when:

  • Contexts are very long AND retrieval of specific information is required
  • The input distribution changes within the sequence (domain shift within a document)
  • You want the model to adapt its processing to the specific input at inference time

The Hybrid Trend

As of 2025-2026, the industry is converging on hybrid architectures:

  • Jamba (AI21): interleaves Mamba layers with attention layers
  • Mamba-2: shows attention and SSMs have a unified mathematical framework
  • TTT layers inside transformers: replacing some attention layers with TTT layers

The likely endpoint: different layers in the same model use different mechanisms depending on what that layer needs to do. Early layers use SSMs for cheap long-range mixing. Middle layers use TTT or attention for precise retrieval. Final layers use attention for generation quality.

Common Confusions

Watch Out

Mamba is not strictly worse than attention

Mamba processes each token in O(dN)O(dN) vs attention's O(nd)O(nd). For context lengths where n>Nn > N (which is almost always, since N16N \approx 16), Mamba is cheaper per token. The tradeoff is retrieval precision, not compute. On tasks that do not require exact retrieval (audio classification, time-series prediction), Mamba matches transformers at much lower cost.

Watch Out

TTT is not just fine-tuning during inference

TTT updates a single small layer's weights using a self-supervised loss. It does not fine-tune the whole model. The update is fast (one gradient step), local (one layer), and unsupervised (no labels needed). It is closer to online learning than to fine-tuning.

References

  • Vaswani et al., "Attention Is All You Need" (2017). The transformer.
  • Gu & Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces" (2023)
  • Sun et al., "Learning to (Learn at Test Time): RNNs with Expressive Hidden States" (ICML 2024)
  • Dao & Gu, "Transformers are SSMs" (2024). The Mamba-2 / unification paper.