Skip to main content

Paper breakdown

Kimi K2: Open Agentic Intelligence

Kimi Team · 2026 · arXiv preprint (technical report)

1.04 trillion total / 32 billion activated parameter Mixture-of-Experts LLM trained on 15.5T tokens with the MuonClip optimizer (Muon plus per-head QK-Clip post-update weight rescaling). Pairs ultra-sparse MoE with multi-head latent attention from DeepSeek-V3 and a synthetic agentic data pipeline. Top open-weight model on Tau2-bench, ACEBench, and SWE-bench Verified at release.

Overview

Moonshot AI's Kimi K2 (2026) is a 1.04 trillion-parameter Mixture-of-Experts (MoE) language model with 32 billion activated parameters, pre-trained on 15.5 trillion tokens with no observed loss spike across the entire run. The technical report's three engineering contributions are:

  1. MuonClip, an optimizer that combines the Muon update (Newton-Schulz-orthogonalized gradient with RMS rescaling) with a post-update weight clip on attention query and key projections (QK-Clip) to prevent the attention-logit growth that destabilizes Muon at trillion-parameter scale.
  2. A synthetic agentic data pipeline that produces 20,000\sim 20{,}000 tools, thousands of agents, multi-turn rubric-evaluated trajectories, and a hybrid simulated-plus-real-execution sandbox, used both for SFT and as the source of verifiable rewards.
  3. A self-critique rubric reward that extends RL-from-verifiable-reward (RLVR) to subjective tasks by having the model rank its own outputs against curated rubrics in a closed loop with the policy.

Architecturally, K2 follows the DeepSeek-V3 template but pushes sparsity further: 384 experts (versus 256 in DeepSeek-V3) with the same 8 active per token, halving the attention-head count from 128 to 64, and reducing the number of dense layers to 1. The model uses Multi-head Latent Attention (MLA) for cache compression. The result is the top open-source non-thinking model on Tau2-bench (66.1), ACEBench-En (76.5), and SWE-bench Verified (65.8) at release, and the fifth-overall LMSYS Arena entry across >3,000>3{,}000 user votes.

The paper is a technical report, not a methodologically novel contribution. Its weight is in the demonstrated stability of the training run at trillion-parameter MoE scale and in the public release of base and post-trained checkpoints. Most prior open-weight models at this scale (DeepSeek-V3, Llama 4-class) have been AdamW-trained; Kimi K2 is the first open release showing that Muon-family optimization can be made stable at this scale with a small post-hoc fix.

Mathematical Contributions

Architecture: ultra-sparse MoE with MLA

K2 has 61 transformer layers, 1 dense and 60 MoE. Each MoE layer has 384 experts plus 1 shared expert; the gating function selects 8 experts per token, giving sparsity ratio s=total expertsactive experts=3848=48.s = \frac{\text{total experts}}{\text{active experts}} = \frac{384}{8} = 48. For a fixed activated-parameter budget, the paper's sparsity scaling law (their Section 2.3 and Figure 5) shows validation loss continues to drop monotonically as sparsity grows: at the compute-optimal validation loss of 1.51.5, sparsity-4848 uses 1.69×1.69\times fewer FLOPs than sparsity-88. The cost is a more complex routing infrastructure and bigger memory footprint, both worth it at this scale.

The attention is Multi-head Latent Attention (MLA, from DeepSeek-V2/V3), which factorizes the key and value projections through a low-rank latent: K=WKUcKVK = W^{KU} c^{KV} where cKV=WKDhc^{KV} = W^{KD} h is a 512\sim 512-dimensional latent vector cached per token instead of the full key and value. This gives roughly 13×13\times KV-cache compression versus full multi-head attention with the same head dimension. K2 uses 64 attention heads (DeepSeek-V3 used 128); the paper's Figure 6 ablation shows doubling heads from 64 to 128 only buys 0.50.51.2%1.2\% validation-loss improvement at the cost of an 83%83\% increase in inference FLOPs at 128k128k-token context — the head count is reduced for inference efficiency, especially in long-context agentic workloads.

The Muon update

Muon (Jordan et al., 2024) replaces AdamW's diagonal preconditioner with a Newton-Schulz orthogonalization of the momentum-smoothed gradient. For each weight WRn×m\mathbf W \in \mathbb{R}^{n\times m}: Mt=μMt1+Gt\mathbf M_t = \mu \mathbf M_{t-1} + \mathbf G_t Ot=NewtonSchulz(Mt)max(n,m)0.2\mathbf O_t = \text{NewtonSchulz}(\mathbf M_t)\,\sqrt{\max(n, m)} \cdot 0.2 Wt=Wt1η(Ot+λWt1).\mathbf W_t = \mathbf W_{t-1} - \eta\,(\mathbf O_t + \lambda \mathbf W_{t-1}). The orthogonalization Ot=(MtMt)1/2Mt\mathbf O_t = (\mathbf M_t \mathbf M_t^\top)^{-1/2} \mathbf M_t projects the update onto the Stiefel manifold of orthogonal matrices, and the max(n,m)0.2\sqrt{\max(n,m)} \cdot 0.2 factor is chosen so that OtRMS\|\mathbf O_t\|_{\text{RMS}} matches what AdamW would produce at the same point. The Newton-Schulz iteration is a polynomial of MtMt\mathbf M_t \mathbf M_t^\top that converges in 5\sim 5 steps to the orthogonalized matrix without an SVD. Empirically, at constant compute and data, Muon outperforms AdamW substantially on token efficiency for standard transformer training.

Why scaling Muon is hard: attention-logit explosion

The paper documents an instability that appears more frequently with Muon than with AdamW: per-head max attention logits grow without bound during training. In a mid-scale 9B-activated, 53B-total experiment with vanilla Muon, the per-head max logit Smaxh=1dmaxXBmaxi,jqih(kjh)S^h_{\max} = \frac{1}{\sqrt{d}}\,\max_{\mathbf X \in B}\,\max_{i,j}\,\mathbf q_i^h (\mathbf k_j^h)^\top crossed 1,0001{,}000 within 1,000\sim 1{,}000 steps and triggered loss spikes (paper Figure 2 left). Logit soft-cap (Gemma's fix) directly clips the logit but allows the underlying QK\mathbf Q \mathbf K^\top products to keep growing; Query-Key Normalization (QK-Norm) normalizes K\mathbf K but is incompatible with MLA because the full key matrix is never materialized.

QK-Clip: post-update weight rescaling

The contribution is QK-Clip: after each optimizer step, for any head hh whose forward-pass max logit exceeded a threshold τ\tau (the paper uses τ=100\tau = 100), rescale the query and key projection weights to bring it back below τ\tau on the next forward pass.

Per-head scale factor: γh=min(1,τ/Smaxh).\gamma_h = \min(1, \tau / S^h_{\max}). For a regular multi-head attention layer: WqhγhαWqh,Wkhγh1αWkh\mathbf W^h_q \leftarrow \gamma_h^\alpha\,\mathbf W^h_q, \qquad \mathbf W^h_k \leftarrow \gamma_h^{1-\alpha}\,\mathbf W^h_k with α=0.5\alpha = 0.5 by default (split the rescaling symmetrically between Q\mathbf Q and K\mathbf K). For MLA, the head-shared rotary key is left untouched to avoid coupling heads, and the head-specific components qC,kC\mathbf q^C, \mathbf k^C are each scaled by γh\sqrt{\gamma_h} while the head-specific rotary qR\mathbf q^R is scaled by γh\gamma_h.

Crucially, QK-Clip does not alter the forward pass that already happened — the gradient on this step is computed against the unclipped weights. The clip applies post-update, treating the observed SmaxhS^h_{\max} as a guiding signal for next-step weight magnitude. The condition γh1\gamma_h \le 1 guarantees that the rescaling is monotone-decreasing, so it acts as a one-sided projection onto the constraint set {W:Smaxhτ}\{\mathbf W : S^h_{\max} \le \tau\}.

The full optimizer is Wt=QKClipτ ⁣(Wt1η(Ot+λWt1))\mathbf W_t = \text{QKClip}_\tau\!\left(\mathbf W_{t-1} - \eta\,(\mathbf O_t + \lambda \mathbf W_{t-1})\right) with Ot\mathbf O_t from Muon. Over the entire 15.5T-token K2 training run, the max logit hits τ=100\tau = 100 in roughly 30% of steps in the first phase, then decays naturally below τ\tau as the optimization stabilizes; QK-Clip stops being active for the remainder of training (paper Figure 2 right).

Pre-training data: rephrasing for token efficiency

K2's 15.5T tokens are enriched by a synthetic-rephrasing pipeline. The premise is that a single pass over knowledge-rich text provides incomplete absorption while many passes overfit; rephrasing the same text into many stylistic variants amplifies effective tokens without overfitting. Knowledge data is processed by:

  • Style- and perspective-diverse prompting (extending WRAP, Maini et al., 2024): an LLM is prompted to rephrase a passage in different writing styles and from different perspectives.
  • Chunk-wise autoregressive generation: long passages are split into 256\sim 256-token chunks; each chunk is rephrased autoregressively conditioned on the previous chunk's rephrased output, preserving global coherence.
  • Fidelity verification: a separate model checks semantic equivalence of the rephrased chunk to the original.

Empirically (paper Table 1), at fixed compute, rephrasing once and training for 10 epochs scores 27.3927.39 on SimpleQA versus raw-data 10-epoch at 23.7623.76; rephrasing 10 times for one epoch improves further to 28.9428.94. The mathematics rephrasing follows a "learning-note" style adapted from SwallowMath (Fujii et al., 2024).

RL algorithm: KL-regularized off-policy importance-style objective

K2 inherits the policy-optimization objective from K1.5. For each prompt xx from the dataset D\mathcal D, KK responses {yi}i=1K\{y_i\}_{i=1}^K are sampled from the previous policy πold\pi_{\text{old}}. The objective is LRL(θ)=ExD ⁣[1Ki=1K ⁣(r(x,yi)rˉ(x)τlogπθ(yix)πold(yix))2]L_{\text{RL}}(\theta) = \mathbb{E}_{x \sim \mathcal D}\!\left[\frac{1}{K}\sum_{i=1}^K\!\left(r(x, y_i) - \bar r(x) - \tau \log \frac{\pi_\theta(y_i \mid x)}{\pi_{\text{old}}(y_i \mid x)}\right)^2\right] where rˉ(x)=1Ki=1Kr(x,yi)\bar r(x) = \frac{1}{K}\sum_{i=1}^K r(x, y_i) is the per-prompt mean reward used as a control variate, and τ>0\tau > 0 is a KL-regularization coefficient. This is a squared error against an advantage-style target r(x,yi)rˉ(x)r(x, y_i) - \bar r(x) minus a KL penalty toward the old policy. Unlike PPO, it has no clipping; unlike DPO, it admits arbitrary scalar rewards rather than only pairwise preferences.

The optimizer for the RL stage is also Muon (with QK-Clip carried over), which the paper claims is the first published RL run at this scale on a Muon-family optimizer. Three additional regularizers stabilize the run:

  • Token-budget penalty. Each task xx is assigned a max-response budget B(x)B(x); responses exceeding B(x)B(x) are truncated and assigned a penalty. This counteracts the well-known "RL makes responses longer" effect on tasks where verbosity is not actually rewarded.
  • PTX auxiliary loss. A small fraction of pre-training-quality tokens are mixed into the RL gradient batch, weighted by an auxiliary PTX loss term, to prevent forgetting of the SFT distribution.
  • Temperature decay. Sampling temperature TT starts high (encouraging exploration on creative-writing and reasoning tasks) and decays linearly toward 1.01.0 over training, shifting from exploration to exploitation.

Self-critique rubric reward

For non-verifiable tasks (creative writing, open-ended dialogue, faithfulness), K2 introduces a Self-Critique Rubric Reward. The model itself acts as a critic: given a prompt and KK candidate responses, it ranks them by performing pairwise comparisons against three rubric sets: core rubrics (the model's invariant identity values), prescriptive rubrics (anti-reward-hacking constraints), and human-annotated rubrics (instruction-specific). The pairwise comparisons are aggregated into a scalar reward via Bradley-Terry-style ranking.

The critic is bootstrapped from a curated open-source plus in-house preference dataset during SFT. During RL, the critic itself is closed-loop refined: on-policy rollouts from verifiable-reward prompts continually update the critic via supervised loss against the verifiable reward, transferring objective performance signal into the critic's evaluation. Over training, the critic and the policy improve together; the critic ratchets its evaluation bar in lockstep with policy quality. This is the K2-specific implementation of constitutional-AI-style self-supervised alignment, with the difference that the critic gets signal from verifiable tasks in real time rather than from a fixed constitution.

Agentic data synthesis

For tool-use post-training, K2's pipeline (paper Section 3.1, Figure 8) constructs 23,000\sim 23{,}000 tools by combining 3,000+3{,}000+ real Model Context Protocol (MCP) tools from GitHub with 20,00020{,}000 synthetic tools generated through a hierarchical domain ontology. From this tool repository it synthesizes thousands of distinct agent system prompts, each equipped with a different tool subset, then generates rubric-paired tasks per agent, then runs a multi-agent simulation (user agent, target agent, tool simulator, judge agent) to produce trajectories. A judge agent filters trajectories against the task rubric; only successful ones enter SFT data. For coding and software engineering, simulated tool execution is replaced with real Kubernetes-backed sandboxes that run actual code and unit tests. The combined output trains both the SFT distribution and provides verifiable-reward signal for the RL stage.

What Kimi K2 does not claim

It is not a chain-of-thought ("thinking") model in the o1/DeepSeek-R1 sense. The reported numbers are non-thinking baselines. The paper notes that thinking-mode extensions are an active area but not the focus.

It does not provide a sharp ablation isolating MuonClip from the rest of the architectural and data choices. The reported gains on standard benchmarks are joint effects of optimizer, sparsity, MLA, data rephrasing, and post-training. The optimizer stability claim ("zero loss spikes across 15.5T tokens") is the cleanest single-variable claim; the benchmark numbers are not.

It does not provide a learned-reward-model RL recipe (RLHF in the Christiano-2017 sense). The reward signal is verifiable for math/code/instruction-following, and self-critique-rubric for everything else. There is no separately trained reward model.

Connections to TheoremPath Topics

Why It Matters Now

K2 is the first open-weight release at trillion-parameter MoE scale to use a non-AdamW optimizer end-to-end and provide a complete recipe for stabilizing it. AdamW dominates the open frontier through inertia rather than evidence; smaller-scale Muon studies have shown 20–40% token-efficiency gains, but transferring those gains to trillion-parameter MoE training has been blocked on training-instability questions. MuonClip is the first published answer that holds across the full scale. The next round of open-weight pretraining runs will plausibly default to MuonClip-style optimizers; the AdamW monopoly is over.

The agentic post-training recipe is also worth flagging. The dominant open-weight pattern through 2024–2025 was: pretrain on web text, SFT on cleaned instruction data, RLHF or DPO on preference data. Tool-use was added downstream as fine-tuning. K2 inverts this — synthetic agent trajectories with real-execution sandbox grounding are a first-class part of the post-training distribution, not a downstream adapter — and the SWE-bench Verified score (65.865.8) suggests this pays off in production-grade software-engineering tasks where tool use is the bottleneck. Expect every serious open-weight LLM after K2 to have a similar pipeline.

The architectural lesson on sparsity is more specific. K2's 1.69×1.69\times FLOP reduction at sparsity-4848 is consistent with smaller-scale MoE scaling laws, but pushing to sparsity-4848 at trillion-parameter scale has serious infrastructure cost — the all-to-all communication for routing 8 active experts out of 384 across nodes is intricate to make efficient. Moonshot has demonstrated this is doable; the open question is whether sparsity-9696 or higher can be made to work, and whether the FLOP-loss curve continues monotonic at those points or breaks.

The reduction in attention heads is a second architectural data point against the conventional wisdom (DeepSeek-V3's heuristic of "heads =2= 2 \cdot layers"). At the long-context regime where agentic workloads live, KV-cache and attention-FLOPs scale linearly in head count for the long-context part; the head count had been chosen for short-context bandwidth utilization, which is the wrong objective once context is >128k>128k. Halving heads is a small architectural change with a large inference-cost dividend.

References

Canonical:

Direct precursors (Muon and MoE):

  • Jordan, K. et al. (2024). "Muon: An optimizer for hidden layers in neural networks." Blog post / arXiv. The base optimizer K2 builds on.
  • Liu, A. et al. (2024). "DeepSeek-V3 Technical Report." arXiv:2412.19437. Source of MLA, the architecture template, and many of the MoE design choices.
  • DeepSeek-AI. (2024). "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model." arXiv:2405.04434. Original MLA paper.
  • Shazeer, N. et al. (2017). "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer." ICLR 2017. arXiv:1701.06538. The MoE template.

Pre-training data and rephrasing:

  • Maini, P. et al. (2024). "Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling (WRAP)." ICLR 2024. arXiv:2401.16380. The rephrasing-for-token-efficiency idea K2 extends.
  • Kimi Team. (2025). "Kimi k1.5: Scaling Reinforcement Learning with LLMs." arXiv:2501.12599. The K1.5 RL algorithm K2 inherits.

RL and alignment lineage:

  • Christiano, P. F. et al. (2017). "Deep Reinforcement Learning from Human Preferences." NeurIPS 2017. arXiv:1706.03741. Original RLHF.
  • Bai, Y. et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." arXiv:2212.08073. The self-critique-rubric lineage.
  • Schulman, J. et al. (2017). "Proximal Policy Optimization Algorithms." arXiv:1707.06347. PPO; K2's KL-regularized squared-error objective is in the same family.

Attention-logit instability fixes (alternatives to QK-Clip):

  • Team, Gemma et al. (2024). "Gemma 2: Improving Open Language Models at a Practical Size." arXiv:2408.00118. Logit soft-cap.
  • Henry, A. et al. (2020). "Query-Key Normalization for Transformers." Findings of EMNLP 2020. arXiv:2010.04245. QK-Norm; not applicable to MLA.

Standard textbook:

  • Prince, S. J. D. (2023). Understanding Deep Learning. MIT Press. Chapter 12 — transformers and attention; Chapter 16 — pretraining and post-training.

Connected topics

Last reviewed: May 6, 2026