Model Timeline
DeepSeek Models
DeepSeek's model family: MoE architectures with Multi-head Latent Attention, fine-grained expert routing, RL-trained reasoning in DeepSeek-R1, the V3.1/V3.2 hybrid reasoning line, and the V4 Preview 1M-context release.
Prerequisites
Why This Matters
DeepSeek demonstrated that architectural innovation and careful training can compensate for having less compute than the largest Western labs. DeepSeek-V3 achieves frontier performance with a mixture-of-experts architecture that activates only 37B of 671B total parameters per token. DeepSeek-R1 showed that chain-of-thought reasoning can emerge from reinforcement learning on verifiable tasks. The V3.1 and V3.2 line moved toward hybrid inference: one model can serve ordinary chat, thinking-mode reasoning, and agent-style tool use. V4 Preview (April 2026) extends this with token-wise compression plus DeepSeek Sparse Attention (DSA) and makes a 1M-token context window the default across all official services.
KV cache footprint per sequence across DeepSeek architecture generations (60 layers, 128 heads, d_head = 128, fp16)
The MHA bar at 1M tokens shows why dense attention does not scale: the cache alone would run into multiple terabytes per sequence. MLA already turns 1M-token context into a tractable engineering target; V4's token compression and DeepSeek Sparse Attention extend that margin so 1M context can be the default surface rather than a specialist mode. The V4 factor of 4 shown here is a representative estimate, not a published ratio.
DeepSeek-V2 (May 2024)
Architecture. MoE transformer with 236B total parameters, 21B active per token. 160 experts with 6 active per token plus 2 shared experts that process every token.
Key innovation: Multi-head Latent Attention (MLA). Standard multi-head attention stores separate key and value vectors for each head in the KV cache, which grows linearly with sequence length and number of heads. MLA compresses the KV cache by projecting keys and values into a shared low-dimensional latent space. This reduces KV cache memory by roughly 93% compared to standard multi-head attention.
Key innovation: DeepSeekMoE. Fine-grained expert segmentation: instead of a few large experts, use many small experts. This allows more precise routing and better expert specialization. Shared experts handle common patterns while routed experts specialize.
Training. 8.1T tokens. Trained on a cluster of Nvidia H800 GPUs (the export-restricted variant of H100, available to Chinese labs).
Result. Competitive with Llama 3 70B and Mixtral 8x22B at significantly lower inference cost due to the low active parameter count.
DeepSeek-V3 (December 2024)
Architecture. MoE transformer with 671B total parameters, 37B active per token. 256 routed experts with 8 active per token, plus 1 shared expert.
Training. 14.8T tokens. Estimated training cost: approximately 5.6 million USD in compute, dramatically lower than comparable frontier models. This cost estimate (from the technical report) generated significant attention because it suggested frontier-quality models could be trained for a fraction of what US labs spend.
Auxiliary-loss-free load balancing. Standard MoE training uses an auxiliary loss to prevent expert collapse (all tokens routing to the same few experts). DeepSeek-V3 introduced a load-balancing method that does not require an auxiliary loss term, reducing interference with the primary training objective.
Multi-token prediction. The model predicts multiple future tokens simultaneously during training, which improves training efficiency and downstream performance.
Result. Competitive with GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro on most benchmarks. Particularly strong on math and coding. The combination of frontier quality, open weights, and low training cost made V3 one of the most discussed model releases of 2024.
DeepSeek-R1 (January 2025)
Architecture. Same base architecture as DeepSeek-V3 (671B/37B MoE).
Key innovation: reasoning via RL. DeepSeek-R1 was trained to reason using reinforcement learning on tasks with verifiable answers (math problems, coding challenges). The model generates long chain-of-thought reasoning traces before producing a final answer. Critically, the reasoning behavior emerged from RL training alone (DeepSeek-R1-Zero), without first training on human-written reasoning traces.
DeepSeek-R1-Zero
A variant trained with pure RL (no supervised fine-tuning on reasoning traces). Given a math or coding problem, the model is rewarded for producing correct final answers. Through RL training, the model spontaneously learned to generate chain-of-thought reasoning, self-check its work, and explore multiple solution paths. This demonstrated that explicit reasoning can emerge from outcome-based RL without human demonstrations of the reasoning process.
Distilled models. DeepSeek released distilled versions (1.5B to 70B dense models) trained on reasoning traces from R1. These smaller models inherit some of R1's reasoning ability at much lower cost.
Result. Competitive with OpenAI's o1 on math and coding benchmarks (AIME 2024, Codeforces). Open-weight release enabled the research community to study reasoning model internals for the first time.
DeepSeek-R1-0528 (May 2025)
DeepSeek-R1-0528 is a post-training refresh of R1 rather than a new base architecture. DeepSeek's release notes say it still uses the December 2024 DeepSeek-V3 base model, but spends more post-training compute to improve reasoning depth. The public web, app, and API versions kept a 64K context; the open-weight checkpoint supports 128K context.
Why it matters. The update makes the R1 line less of a one-off research release and more of a maintained reasoning product. It also distills R1-0528 reasoning traces into an 8B Qwen3-base model, showing how large reasoning models can become teachers for much cheaper deployments.
DeepSeek-V3.1 (August 2025)
Architecture. Same 671B total / 37B active MoE structure as V3, with 128K context. The Hugging Face model card describes V3.1 as a hybrid model supporting both thinking and non-thinking modes via chat-template changes.
Key changes.
- Hybrid inference: one checkpoint serves fast non-thinking responses and longer thinking-mode reasoning.
- Tool and agent improvements: post-training improved tool calling, search-agent behavior, and multi-step task completion.
- Long-context extension: DeepSeek reports expanded 32K and 128K extension phases on additional long-document data.
Result. V3.1 reframed DeepSeek as more than a cheap frontier-chat model. It is a bridge between the R1 reasoning line and product-facing agent tasks.
DeepSeek-V3.2 and V3.2-Speciale (December 2025)
DeepSeek first released V3.2-Exp in September 2025 to test DeepSeek Sparse Attention (DSA), then released V3.2 and V3.2-Speciale in December 2025. DeepSeek's API docs describe V3.2 as the official successor to V3.2-Exp and V3.2-Speciale as a higher-reasoning variant for community evaluation.
What changed. The V3.2 line keeps the MoE-efficiency story but adds sparse attention and larger synthetic-agent training data. This targets the practical bottleneck in long-context agents: not just answering a hard problem, but holding enough context while calling tools over many steps.
Not shipped: DeepSeek-R2. As of April 24, 2026, DeepSeek's official release notes document R1 updates, V3.1, V3.2-Exp, V3.2/V3.2-Speciale, and the V4 Preview, but not a DeepSeek-R2 release. Treat R2 claims as rumors unless DeepSeek publishes a model card, API release note, or technical report.
DeepSeek-V4 Preview (April 2026)
DeepSeek released the V4 Preview on April 24, 2026 through chat.deepseek.com
and the API (compatible with OpenAI ChatCompletions and Anthropic-style
endpoints by changing the model parameter).
Two variants.
- V4-Pro: 1.6T total parameters, 49B active per token. Positioned as rivaling top closed-source frontier models. DeepSeek reports open-source SOTA on agentic-coding benchmarks and leads open models on world-knowledge evals.
- V4-Flash: 284B total parameters, 13B active per token. Designed as the cheap-and-fast variant while keeping reasoning quality close to V4-Pro.
Key change: 1M context is now the default. Every official DeepSeek surface (web chat, app, API) serves a 1M-token context window by default. V3/V3.1/V3.2 were 128K-default and relied on extended-context checkpoints for long inputs.
Key change: Token-wise compression + DSA. V4 pairs token-wise compression with DeepSeek Sparse Attention (first shipped in V3.2-Exp). The combination targets the two costs of long context: KV cache memory and attention FLOPs. simultaneously. Compression reduces the state each cached token occupies; DSA limits how many past tokens each query attends to. At 1M tokens the savings compound.
Inference modes.
- Expert Mode (chat UI): routes to V4-Pro for the hardest queries.
- Instant Mode (chat UI): routes to V4-Flash for low-latency responses.
- Thinking / Non-Thinking (API): the same model checkpoint exposes both modes via chat-template toggles, continuing the hybrid-inference pattern from V3.1.
Why this matters. The V3-line bet on cheap MoE inference (active ≪ total) plus MLA for KV cache. V4 adds sparse attention and token compression and treats long context as a product default rather than a specialist setting. The positioning tracks the broader field: frontier chat, reasoning, and agentic coding all increasingly depend on holding much more context across many tool calls.
Multi-head Latent Attention (MLA)
KV Cache Compression via Latent Projection
Statement
In standard multi-head attention, the KV cache stores values per token per layer for both keys and values, giving a per-token memory cost of per layer. MLA projects all heads into a shared latent vector of dimension , then recovers per-head keys and values via learned up-projections at inference time. The per-token KV cache memory reduces from to per layer. For DeepSeek-V2 with , , and , this is a compression ratio of approximately . The full MLA design also caches a small RoPE-decoupled key component of dimension (typically ), so the actual per-token cache is rather than alone; the ratio above is a simplification that ignores this term.
Intuition
Standard attention stores separate key and value vectors for each head. But these vectors are often correlated across heads because they derive from the same input representation. MLA exploits this by storing only a compressed representation and reconstructing per-head keys and values on the fly. You trade a small amount of compute (the up-projection) for a large memory saving.
Proof Sketch
The compression works if the joint key-value representation across heads lies approximately in a low-dimensional subspace. MLA learns this subspace during training. The up-projection matrices are absorbed into the attention computation, so the only additional cost is matrix multiplications with the projection matrices, which is small relative to the memory savings for long sequences.
Why It Matters
KV cache is the primary memory bottleneck for long-context inference with large models. For a 671B parameter model serving sequences of 100K+ tokens, the KV cache can exceed the model weights in memory. MLA makes long-context inference feasible on hardware that would otherwise be insufficient. This is one of the reasons DeepSeek-V2 and V3 can offer competitive long-context performance at lower cost.
Failure Mode
If the key-value representations across heads are not well-approximated by a low-rank structure, compression introduces approximation error that degrades attention quality. In practice, the learned latent space captures most of the relevant information, but there may be tasks where fine-grained per-head information matters and MLA slightly underperforms standard attention. The compression ratio is also fixed at training time; it cannot adapt to the difficulty of individual sequences.
Why DeepSeek Matters for the Field
Cost efficiency. DeepSeek-V3's reported training cost of about 5.6 million USD challenges the assumption that frontier models require hundreds of millions in compute. Even if the true fully-loaded cost is higher (the estimate excludes researcher salaries, failed experiments, and infrastructure), it demonstrates that architectural efficiency (MoE, MLA) can substantially reduce the compute needed for frontier quality.
Open weights for reasoning models. Before DeepSeek-R1, reasoning models (OpenAI o1, o1-pro) were closed. R1's open release let researchers study how reasoning emerges from RL, what the chain-of-thought traces look like internally, and how to distill reasoning into smaller models.
Hardware constraints driving innovation. Chinese labs have restricted access to top-tier Nvidia GPUs (H100 vs. H800). This constraint may have incentivized architectural innovations (MoE, MLA) that reduce compute requirements. Constraints sometimes accelerate innovation.
Hybrid product path. V3.1 and V3.2 show the post-R1 direction: combine ordinary chat, explicit thinking, tool calls, and sparse long-context inference inside one deployable family. That is a different bet from releasing a separate reasoning-only successor.
Common Confusions
671B parameters does not mean 671B inference cost
DeepSeek-V3 has 671B total parameters but only activates 37B per token. The inference FLOPs per token are comparable to a ~37B dense model. However, all 671B parameters must be loaded into GPU memory, so the memory footprint is still large. The efficiency gain is in compute per token, not memory.
R1-Zero reasoning is not the same as prompting for chain-of-thought
When you prompt a standard model with "think step by step", you are using a pattern the model learned during pretraining. DeepSeek-R1-Zero's reasoning emerged from RL training on outcomes. It was never shown examples of reasoning traces. The model discovered that generating intermediate steps helps it get correct answers, purely from the reward signal.
V4 1M context does not mean linear attention over 1M tokens
V4-Pro and V4-Flash default to a 1M-token context window, but they do not attend densely to every prior token. The combination of token-wise compression and DeepSeek Sparse Attention keeps the cost sub-quadratic: each query attends to a sparsely-selected subset of past tokens, and each cached token occupies a compressed representation. Dense 1M-to-1M self-attention at V4's scale would dominate both compute and memory; the product is feasible because of the attention sparsity, not in spite of it.
Exercises
Problem
DeepSeek-V3 has 256 routed experts and activates 8 per token, plus 1 shared expert. What fraction of the routed expert parameters are active for any given token? If total parameters are 671B and shared/non-expert parameters account for roughly 37B, estimate the total routed expert parameters.
Problem
MLA compresses the KV cache from dimensions per token per layer to dimensions. For a model with 60 layers, heads, , sequence length 100K tokens, and , compute the KV cache size in GB for both standard attention and MLA (using float16, 2 bytes per value).
Problem
V4 makes 1M-token context the default. Suppose a naive dense-attention model had identical per-token MLA KV cache cost as in the previous exercise (about 61 KB of cache per token per model). At 1M tokens, that cache alone would be about 61 GB per sequence. But the attention compute for that same dense model scales as in the sequence length . If DeepSeek Sparse Attention (DSA) restricts each query to at most keys and token-wise compression reduces the effective cached state by a factor of 4, what are the resulting per-token KV-cache footprint and the attention compute scaling? Assume compute is dominated by the query-to-key dot products per layer.
References
Canonical:
- DeepSeek-AI, "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model" (2024, arXiv:2405.04434)
- DeepSeek-AI, "DeepSeek-V3 Technical Report" (2024, arXiv:2412.19437)
- DeepSeek-AI, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (2025, arXiv:2501.12948)
Current:
- Dai et al., "DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models" (2024, arXiv:2401.06066): fine-grained expert segmentation and shared-expert design underlying V2/V3.
- Shao et al., "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (2024, arXiv:2402.03300): introduces GRPO (group relative policy optimization), the RL algorithm used in R1.
- Wang et al., "Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts" (2024, arXiv:2408.15664): the bias-term balancing scheme used in V3.
- DeepSeek API Docs, "DeepSeek-R1 update" (May 28, 2025), "DeepSeek-V3.1 Release" (Aug 21, 2025), "Introducing DeepSeek-V3.2-Exp" (Sep 29, 2025), "DeepSeek-V3.2 Release" (Dec 1, 2025), and "DeepSeek-V4 Preview" (April 24, 2026,
api-docs.deepseek.com/news/news260424). - DeepSeek-AI, DeepSeek-V3.1 model card, Hugging Face (2025), https://huggingface.co/deepseek-ai/DeepSeek-V3.1
Next Topics
- Model comparison table: structured comparison across frontier model families
Last reviewed: April 24, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
2- Mixture of Expertslayer 4 · tier 2
- Transformer Architecturelayer 4 · tier 2
Derived topics
1- Model Comparison Tablelayer 5 · tier 2
Graph-backed continuations