DeepSeek Models

Sneiderman, Robby

Model Timeline

DeepSeek Models

DeepSeek's model family: MoE architectures with Multi-head Latent Attention, fine-grained expert routing, RL-trained reasoning in DeepSeek-R1, the V3.1/V3.2 hybrid reasoning line, and the V4 Preview 1M-context release.

CoreTier 1FrontierReference~50 min

Prerequisites

Transformer Architecture Mixture of Experts

Quiz (5)Pulse Check Prereq Map

Why This Matters

DeepSeek demonstrated that architectural innovation and careful training can compensate for having less compute than the largest Western labs. DeepSeek-V3 achieves frontier performance with a mixture-of-experts architecture that activates only 37B of 671B total parameters per token. DeepSeek-R1 showed that chain-of-thought reasoning can emerge from reinforcement learning on verifiable tasks. The V3.1 and V3.2 line moved toward hybrid inference: one model can serve ordinary chat, thinking-mode reasoning, and agent-style tool use. V4 Preview (April 2026) extends this with token-wise compression plus DeepSeek Sparse Attention (DSA) and makes a 1M-token context window the default across all official services.

KV cache footprint per sequence across DeepSeek architecture generations (60 layers, 128 heads, d_head = 128, fp16)

The MHA bar at 1M tokens shows why dense attention does not scale: the cache alone would run into multiple terabytes per sequence. MLA already turns 1M-token context into a tractable engineering target; V4's token compression and DeepSeek Sparse Attention extend that margin so 1M context can be the default surface rather than a specialist mode. The V4 factor of 4 shown here is a representative estimate, not a published ratio.

DeepSeek-V2 (May 2024)

Architecture. MoE transformer with 236B total parameters, 21B active per token. 160 experts with 6 active per token plus 2 shared experts that process every token.

Key innovation: Multi-head Latent Attention (MLA). Standard multi-head attention stores separate key and value vectors for each head in the KV cache, which grows linearly with sequence length and number of heads. MLA compresses the KV cache by projecting keys and values into a shared low-dimensional latent space. This reduces KV cache memory by roughly 93% compared to standard multi-head attention.

Key innovation: DeepSeekMoE. Fine-grained expert segmentation: instead of a few large experts, use many small experts. This allows more precise routing and better expert specialization. Shared experts handle common patterns while routed experts specialize.

Training. 8.1T tokens. Trained on a cluster of Nvidia H800 GPUs (the export-restricted variant of H100, available to Chinese labs).

Result. Competitive with Llama 3 70B and Mixtral 8x22B at significantly lower inference cost due to the low active parameter count.

DeepSeek-V3 (December 2024)

Architecture. MoE transformer with 671B total parameters, 37B active per token. 256 routed experts with 8 active per token, plus 1 shared expert.

Training. 14.8T tokens. Estimated training cost: approximately 5.6 million USD in compute, dramatically lower than comparable frontier models. This cost estimate (from the technical report) generated significant attention because it suggested frontier-quality models could be trained for a fraction of what US labs spend.

Auxiliary-loss-free load balancing. Standard MoE training uses an auxiliary loss to prevent expert collapse (all tokens routing to the same few experts). DeepSeek-V3 introduced a load-balancing method that does not require an auxiliary loss term, reducing interference with the primary training objective.

Multi-token prediction. The model predicts multiple future tokens simultaneously during training, which improves training efficiency and downstream performance.

Result. Competitive with GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro on most benchmarks. Particularly strong on math and coding. The combination of frontier quality, open weights, and low training cost made V3 one of the most discussed model releases of 2024.

DeepSeek-R1 (January 2025)

Architecture. Same base architecture as DeepSeek-V3 (671B/37B MoE).

Key innovation: reasoning via RL. DeepSeek-R1 was trained to reason using reinforcement learning on tasks with verifiable answers (math problems, coding challenges). The model generates long chain-of-thought reasoning traces before producing a final answer. Critically, the reasoning behavior emerged from RL training alone (DeepSeek-R1-Zero), without first training on human-written reasoning traces.

Definition

DeepSeek-R1-Zero

A variant trained with pure RL (no supervised fine-tuning on reasoning traces). Given a math or coding problem, the model is rewarded for producing correct final answers. Through RL training, the model spontaneously learned to generate chain-of-thought reasoning, self-check its work, and explore multiple solution paths. This demonstrated that explicit reasoning can emerge from outcome-based RL without human demonstrations of the reasoning process.

Distilled models. DeepSeek released distilled versions (1.5B to 70B dense models) trained on reasoning traces from R1. These smaller models inherit some of R1's reasoning ability at much lower cost.

Result. Competitive with OpenAI's o1 on math and coding benchmarks (AIME 2024, Codeforces). Open-weight release enabled the research community to study reasoning model internals for the first time.

DeepSeek-R1-0528 (May 2025)

DeepSeek-R1-0528 is a post-training refresh of R1 rather than a new base architecture. DeepSeek's release notes say it still uses the December 2024 DeepSeek-V3 base model, but spends more post-training compute to improve reasoning depth. The public web, app, and API versions kept a 64K context; the open-weight checkpoint supports 128K context.

Why it matters. The update makes the R1 line less of a one-off research release and more of a maintained reasoning product. It also distills R1-0528 reasoning traces into an 8B Qwen3-base model, showing how large reasoning models can become teachers for much cheaper deployments.

DeepSeek-V3.1 (August 2025)

Architecture. Same 671B total / 37B active MoE structure as V3, with 128K context. The Hugging Face model card describes V3.1 as a hybrid model supporting both thinking and non-thinking modes via chat-template changes.

Key changes.

Hybrid inference: one checkpoint serves fast non-thinking responses and longer thinking-mode reasoning.
Tool and agent improvements: post-training improved tool calling, search-agent behavior, and multi-step task completion.
Long-context extension: DeepSeek reports expanded 32K and 128K extension phases on additional long-document data.

Result. V3.1 reframed DeepSeek as more than a cheap frontier-chat model. It is a bridge between the R1 reasoning line and product-facing agent tasks.

DeepSeek-V3.2 and V3.2-Speciale (December 2025)

DeepSeek first released V3.2-Exp in September 2025 to test DeepSeek Sparse Attention (DSA), then released V3.2 and V3.2-Speciale in December 2025. DeepSeek's API docs describe V3.2 as the official successor to V3.2-Exp and V3.2-Speciale as a higher-reasoning variant for community evaluation.

What changed. The V3.2 line keeps the MoE-efficiency story but adds sparse attention and larger synthetic-agent training data. This targets the practical bottleneck in long-context agents: not just answering a hard problem, but holding enough context while calling tools over many steps.

Not shipped: DeepSeek-R2. As of April 24, 2026, DeepSeek's official release notes document R1 updates, V3.1, V3.2-Exp, V3.2/V3.2-Speciale, and the V4 Preview, but not a DeepSeek-R2 release. Treat R2 claims as rumors unless DeepSeek publishes a model card, API release note, or technical report.

DeepSeek-V4 Preview (April 2026)

DeepSeek released the V4 Preview on April 24, 2026 through chat.deepseek.com and the API (compatible with OpenAI ChatCompletions and Anthropic-style endpoints by changing the model parameter).

Two variants.

V4-Pro: 1.6T total parameters, 49B active per token. Positioned as rivaling top closed-source frontier models. DeepSeek reports open-source SOTA on agentic-coding benchmarks and leads open models on world-knowledge evals.
V4-Flash: 284B total parameters, 13B active per token. Designed as the cheap-and-fast variant while keeping reasoning quality close to V4-Pro.

Key change: 1M context is now the default. Every official DeepSeek surface (web chat, app, API) serves a 1M-token context window by default. V3/V3.1/V3.2 were 128K-default and relied on extended-context checkpoints for long inputs.

Key change: Token-wise compression + DSA. V4 pairs token-wise compression with DeepSeek Sparse Attention (first shipped in V3.2-Exp). The combination targets the two costs of long context: KV cache memory and attention FLOPs. simultaneously. Compression reduces the state each cached token occupies; DSA limits how many past tokens each query attends to. At 1M tokens the savings compound.

Inference modes.

Expert Mode (chat UI): routes to V4-Pro for the hardest queries.
Instant Mode (chat UI): routes to V4-Flash for low-latency responses.
Thinking / Non-Thinking (API): the same model checkpoint exposes both modes via chat-template toggles, continuing the hybrid-inference pattern from V3.1.

Why this matters. The V3-line bet on cheap MoE inference (active ≪ total) plus MLA for KV cache. V4 adds sparse attention and token compression and treats long context as a product default rather than a specialist setting. The positioning tracks the broader field: frontier chat, reasoning, and agentic coding all increasingly depend on holding much more context across many tool calls.

Multi-head Latent Attention (MLA)

Proposition

KV Cache Compression via Latent Projection

Statement

In standard multi-head attention, the KV cache stores $h \cdot d_h$ values per token per layer for both keys and values, giving a per-token memory cost of $2 \cdot h \cdot d_h$ per layer. MLA projects all heads into a shared latent vector of dimension $d_c \ll h \cdot d_h$ , then recovers per-head keys and values via learned up-projections at inference time. The per-token KV cache memory reduces from $2 \cdot h \cdot d_h$ to $d_c$ per layer. For DeepSeek-V2 with $h = 128$ , $d_h = 128$ , and $d_c = 512$ , this is a compression ratio of approximately $\frac{2 \times 128 \times 128}{512} = 64\times$ . The full MLA design also caches a small RoPE-decoupled key component of dimension $d_h^R$ (typically $64$ ), so the actual per-token cache is $d_c + d_h^R$ rather than $d_c$ alone; the $64\times$ ratio above is a simplification that ignores this term.

Intuition

Standard attention stores separate key and value vectors for each head. But these vectors are often correlated across heads because they derive from the same input representation. MLA exploits this by storing only a compressed representation and reconstructing per-head keys and values on the fly. You trade a small amount of compute (the up-projection) for a large memory saving.

Proof Sketch

The compression works if the joint key-value representation across heads lies approximately in a low-dimensional subspace. MLA learns this subspace during training. The up-projection matrices are absorbed into the attention computation, so the only additional cost is matrix multiplications with the projection matrices, which is small relative to the memory savings for long sequences.

Why It Matters

KV cache is the primary memory bottleneck for long-context inference with large models. For a 671B parameter model serving sequences of 100K+ tokens, the KV cache can exceed the model weights in memory. MLA makes long-context inference feasible on hardware that would otherwise be insufficient. This is one of the reasons DeepSeek-V2 and V3 can offer competitive long-context performance at lower cost.

Failure Mode

If the key-value representations across heads are not well-approximated by a low-rank structure, compression introduces approximation error that degrades attention quality. In practice, the learned latent space captures most of the relevant information, but there may be tasks where fine-grained per-head information matters and MLA slightly underperforms standard attention. The compression ratio is also fixed at training time; it cannot adapt to the difficulty of individual sequences.

report a correction →

Why DeepSeek Matters for the Field

Cost efficiency. DeepSeek-V3's reported training cost of about 5.6 million USD challenges the assumption that frontier models require hundreds of millions in compute. Even if the true fully-loaded cost is higher (the estimate excludes researcher salaries, failed experiments, and infrastructure), it demonstrates that architectural efficiency (MoE, MLA) can substantially reduce the compute needed for frontier quality.

Open weights for reasoning models. Before DeepSeek-R1, reasoning models (OpenAI o1, o1-pro) were closed. R1's open release let researchers study how reasoning emerges from RL, what the chain-of-thought traces look like internally, and how to distill reasoning into smaller models.

Hardware constraints driving innovation. Chinese labs have restricted access to top-tier Nvidia GPUs (H100 vs. H800). This constraint may have incentivized architectural innovations (MoE, MLA) that reduce compute requirements. Constraints sometimes accelerate innovation.

Hybrid product path. V3.1 and V3.2 show the post-R1 direction: combine ordinary chat, explicit thinking, tool calls, and sparse long-context inference inside one deployable family. That is a different bet from releasing a separate reasoning-only successor.

Common Confusions

Watch Out

671B parameters does not mean 671B inference cost

DeepSeek-V3 has 671B total parameters but only activates 37B per token. The inference FLOPs per token are comparable to a ~37B dense model. However, all 671B parameters must be loaded into GPU memory, so the memory footprint is still large. The efficiency gain is in compute per token, not memory.

Watch Out

R1-Zero reasoning is not the same as prompting for chain-of-thought

When you prompt a standard model with "think step by step", you are using a pattern the model learned during pretraining. DeepSeek-R1-Zero's reasoning emerged from RL training on outcomes. It was never shown examples of reasoning traces. The model discovered that generating intermediate steps helps it get correct answers, purely from the reward signal.

Watch Out

V4 1M context does not mean linear attention over 1M tokens

V4-Pro and V4-Flash default to a 1M-token context window, but they do not attend densely to every prior token. The combination of token-wise compression and DeepSeek Sparse Attention keeps the cost sub-quadratic: each query attends to a sparsely-selected subset of past tokens, and each cached token occupies a compressed representation. Dense 1M-to-1M self-attention at V4's scale would dominate both compute and memory; the product is feasible because of the attention sparsity, not in spite of it.

Exercises

ExerciseCore

Problem

DeepSeek-V3 has 256 routed experts and activates 8 per token, plus 1 shared expert. What fraction of the routed expert parameters are active for any given token? If total parameters are 671B and shared/non-expert parameters account for roughly 37B, estimate the total routed expert parameters.

ExerciseAdvanced

Problem

MLA compresses the KV cache from $2 \cdot h \cdot d_h$ dimensions per token per layer to $d_c$ dimensions. For a model with 60 layers, $h = 128$ heads, $d_h = 128$ , sequence length 100K tokens, and $d_c = 512$ , compute the KV cache size in GB for both standard attention and MLA (using float16, 2 bytes per value).

ExerciseAdvanced

Problem

V4 makes 1M-token context the default. Suppose a naive dense-attention model had identical per-token MLA KV cache cost as in the previous exercise (about 61 KB of cache per token per model). At 1M tokens, that cache alone would be about 61 GB per sequence. But the attention compute for that same dense model scales as $O(n^2)$ in the sequence length $n$ . If DeepSeek Sparse Attention (DSA) restricts each query to at most $k = 4096$ keys and token-wise compression reduces the effective cached state by a factor of 4, what are the resulting per-token KV-cache footprint and the attention compute scaling? Assume compute is dominated by the query-to-key dot products $n \cdot k$ per layer.

References

Canonical:

DeepSeek-AI, "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model" (2024, arXiv:2405.04434)
DeepSeek-AI, "DeepSeek-V3 Technical Report" (2024, arXiv:2412.19437)
DeepSeek-AI, "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (2025, arXiv:2501.12948)

Current:

Dai et al., "DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models" (2024, arXiv:2401.06066): fine-grained expert segmentation and shared-expert design underlying V2/V3.
Shao et al., "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models" (2024, arXiv:2402.03300): introduces GRPO (group relative policy optimization), the RL algorithm used in R1.
Wang et al., "Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts" (2024, arXiv:2408.15664): the bias-term balancing scheme used in V3.
DeepSeek API Docs, "DeepSeek-R1 update" (May 28, 2025), "DeepSeek-V3.1 Release" (Aug 21, 2025), "Introducing DeepSeek-V3.2-Exp" (Sep 29, 2025), "DeepSeek-V3.2 Release" (Dec 1, 2025), and "DeepSeek-V4 Preview" (April 24, 2026, api-docs.deepseek.com/news/news260424).
DeepSeek-AI, DeepSeek-V3.1 model card, Hugging Face (2025), https://huggingface.co/deepseek-ai/DeepSeek-V3.1

Next Topics

Model comparison table: structured comparison across frontier model families

Last reviewed: April 24, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Full prerequisite chain All derived topics

Required prerequisites

2

Mixture of Expertslayer 4 · tier 2
Transformer Architecturelayer 4 · tier 2

Derived topics

1

Model Comparison Tablelayer 5 · tier 2

Graph-backed continuations

Model Comparison Table