Paper breakdown

Kimi K2: Open Agentic Intelligence

Kimi Team · 2026 · arXiv preprint (technical report)

1.04 trillion total / 32 billion activated parameter Mixture-of-Experts LLM trained on 15.5T tokens with the MuonClip optimizer (Muon plus per-head QK-Clip post-update weight rescaling). Pairs ultra-sparse MoE with multi-head latent attention from DeepSeek-V3 and a synthetic agentic data pipeline. Top open-weight model on Tau2-bench, ACEBench, and SWE-bench Verified at release.

arXiv:2507.20534

Overview

Moonshot AI's Kimi K2 (2026) is a 1.04 trillion-parameter Mixture-of-Experts (MoE) language model with 32 billion activated parameters, pre-trained on 15.5 trillion tokens with no observed loss spike across the entire run. The technical report's three engineering contributions are:

MuonClip, an optimizer that combines the Muon update (Newton-Schulz-orthogonalized gradient with RMS rescaling) with a post-update weight clip on attention query and key projections (QK-Clip) to prevent the attention-logit growth that destabilizes Muon at trillion-parameter scale.
A synthetic agentic data pipeline that produces $\sim 20{,}000$ tools, thousands of agents, multi-turn rubric-evaluated trajectories, and a hybrid simulated-plus-real-execution sandbox, used both for SFT and as the source of verifiable rewards.
A self-critique rubric reward that extends RL-from-verifiable-reward (RLVR) to subjective tasks by having the model rank its own outputs against curated rubrics in a closed loop with the policy.

Architecturally, K2 follows the DeepSeek-V3 template but pushes sparsity further: 384 experts (versus 256 in DeepSeek-V3) with the same 8 active per token, halving the attention-head count from 128 to 64, and reducing the number of dense layers to 1. The model uses Multi-head Latent Attention (MLA) for cache compression. The result is the top open-source non-thinking model on Tau2-bench (66.1), ACEBench-En (76.5), and SWE-bench Verified (65.8) at release, and the fifth-overall LMSYS Arena entry across $>3{,}000$ user votes.

The paper is a technical report, not a methodologically novel contribution. Its weight is in the demonstrated stability of the training run at trillion-parameter MoE scale and in the public release of base and post-trained checkpoints. Most prior open-weight models at this scale (DeepSeek-V3, Llama 4-class) have been AdamW-trained; Kimi K2 is the first open release showing that Muon-family optimization can be made stable at this scale with a small post-hoc fix.

Mathematical Contributions

Architecture: ultra-sparse MoE with MLA

K2 has 61 transformer layers, 1 dense and 60 MoE. Each MoE layer has 384 experts plus 1 shared expert; the gating function selects 8 experts per token, giving sparsity ratio $s = \frac{\text{total experts}}{\text{active experts}} = \frac{384}{8} = 48.$ For a fixed activated-parameter budget, the paper's sparsity scaling law (their Section 2.3 and Figure 5) shows validation loss continues to drop monotonically as sparsity grows: at the compute-optimal validation loss of $1.5$ , sparsity- $48$ uses $1.69\times$ fewer FLOPs than sparsity- $8$ . The cost is a more complex routing infrastructure and bigger memory footprint, both worth it at this scale.

The attention is Multi-head Latent Attention (MLA, from DeepSeek-V2/V3), which factorizes the key and value projections through a low-rank latent: $K = W^{KU} c^{KV}$ where $c^{KV} = W^{KD} h$ is a $\sim 512$ -dimensional latent vector cached per token instead of the full key and value. This gives roughly $13\times$ KV-cache compression versus full multi-head attention with the same head dimension. K2 uses 64 attention heads (DeepSeek-V3 used 128); the paper's Figure 6 ablation shows doubling heads from 64 to 128 only buys $0.5$ – $1.2\%$ validation-loss improvement at the cost of an $83\%$ increase in inference FLOPs at $128k$ -token context — the head count is reduced for inference efficiency, especially in long-context agentic workloads.

The Muon update

Muon (Jordan et al., 2024) replaces AdamW's diagonal preconditioner with a Newton-Schulz orthogonalization of the momentum-smoothed gradient. For each weight $\mathbf W \in \mathbb{R}^{n\times m}$ : $\mathbf M_t = \mu \mathbf M_{t-1} + \mathbf G_t$ $\mathbf O_t = \text{NewtonSchulz}(\mathbf M_t)\,\sqrt{\max(n, m)} \cdot 0.2$ $\mathbf W_t = \mathbf W_{t-1} - \eta\,(\mathbf O_t + \lambda \mathbf W_{t-1}).$ The orthogonalization $\mathbf O_t = (\mathbf M_t \mathbf M_t^\top)^{-1/2} \mathbf M_t$ projects the update onto the Stiefel manifold of orthogonal matrices, and the $\sqrt{\max(n,m)} \cdot 0.2$ factor is chosen so that $\|\mathbf O_t\|_{\text{RMS}}$ matches what AdamW would produce at the same point. The Newton-Schulz iteration is a polynomial of $\mathbf M_t \mathbf M_t^\top$ that converges in $\sim 5$ steps to the orthogonalized matrix without an SVD. Empirically, at constant compute and data, Muon outperforms AdamW substantially on token efficiency for standard transformer training.

Why scaling Muon is hard: attention-logit explosion

The paper documents an instability that appears more frequently with Muon than with AdamW: per-head max attention logits grow without bound during training. In a mid-scale 9B-activated, 53B-total experiment with vanilla Muon, the per-head max logit $S^h_{\max} = \frac{1}{\sqrt{d}}\,\max_{\mathbf X \in B}\,\max_{i,j}\,\mathbf q_i^h (\mathbf k_j^h)^\top$ crossed $1{,}000$ within $\sim 1{,}000$ steps and triggered loss spikes (paper Figure 2 left). Logit soft-cap (Gemma's fix) directly clips the logit but allows the underlying $\mathbf Q \mathbf K^\top$ products to keep growing; Query-Key Normalization (QK-Norm) normalizes $\mathbf K$ but is incompatible with MLA because the full key matrix is never materialized.

QK-Clip: post-update weight rescaling

The contribution is QK-Clip: after each optimizer step, for any head $h$ whose forward-pass max logit exceeded a threshold $\tau$ (the paper uses $\tau = 100$ ), rescale the query and key projection weights to bring it back below $\tau$ on the next forward pass.

Per-head scale factor: $\gamma_h = \min(1, \tau / S^h_{\max}).$ For a regular multi-head attention layer: $\mathbf W^h_q \leftarrow \gamma_h^\alpha\,\mathbf W^h_q, \qquad \mathbf W^h_k \leftarrow \gamma_h^{1-\alpha}\,\mathbf W^h_k$ with $\alpha = 0.5$ by default (split the rescaling symmetrically between $\mathbf Q$ and $\mathbf K$ ). For MLA, the head-shared rotary key is left untouched to avoid coupling heads, and the head-specific components $\mathbf q^C, \mathbf k^C$ are each scaled by $\sqrt{\gamma_h}$ while the head-specific rotary $\mathbf q^R$ is scaled by $\gamma_h$ .

Crucially, QK-Clip does not alter the forward pass that already happened — the gradient on this step is computed against the unclipped weights. The clip applies post-update, treating the observed $S^h_{\max}$ as a guiding signal for next-step weight magnitude. The condition $\gamma_h \le 1$ guarantees that the rescaling is monotone-decreasing, so it acts as a one-sided projection onto the constraint set $\{\mathbf W : S^h_{\max} \le \tau\}$ .

The full optimizer is $\mathbf W_t = \text{QKClip}_\tau\!\left(\mathbf W_{t-1} - \eta\,(\mathbf O_t + \lambda \mathbf W_{t-1})\right)$ with $\mathbf O_t$ from Muon. Over the entire 15.5T-token K2 training run, the max logit hits $\tau = 100$ in roughly 30% of steps in the first phase, then decays naturally below $\tau$ as the optimization stabilizes; QK-Clip stops being active for the remainder of training (paper Figure 2 right).

Pre-training data: rephrasing for token efficiency

K2's 15.5T tokens are enriched by a synthetic-rephrasing pipeline. The premise is that a single pass over knowledge-rich text provides incomplete absorption while many passes overfit; rephrasing the same text into many stylistic variants amplifies effective tokens without overfitting. Knowledge data is processed by:

Style- and perspective-diverse prompting (extending WRAP, Maini et al., 2024): an LLM is prompted to rephrase a passage in different writing styles and from different perspectives.
Chunk-wise autoregressive generation: long passages are split into $\sim 256$ -token chunks; each chunk is rephrased autoregressively conditioned on the previous chunk's rephrased output, preserving global coherence.
Fidelity verification: a separate model checks semantic equivalence of the rephrased chunk to the original.

Empirically (paper Table 1), at fixed compute, rephrasing once and training for 10 epochs scores $27.39$ on SimpleQA versus raw-data 10-epoch at $23.76$ ; rephrasing 10 times for one epoch improves further to $28.94$ . The mathematics rephrasing follows a "learning-note" style adapted from SwallowMath (Fujii et al., 2024).

RL algorithm: KL-regularized off-policy importance-style objective

K2 inherits the policy-optimization objective from K1.5. For each prompt $x$ from the dataset $\mathcal D$ , $K$ responses $\{y_i\}_{i=1}^K$ are sampled from the previous policy $\pi_{\text{old}}$ . The objective is $L_{\text{RL}}(\theta) = \mathbb{E}_{x \sim \mathcal D}\!\left[\frac{1}{K}\sum_{i=1}^K\!\left(r(x, y_i) - \bar r(x) - \tau \log \frac{\pi_\theta(y_i \mid x)}{\pi_{\text{old}}(y_i \mid x)}\right)^2\right]$ where $\bar r(x) = \frac{1}{K}\sum_{i=1}^K r(x, y_i)$ is the per-prompt mean reward used as a control variate, and $\tau > 0$ is a KL-regularization coefficient. This is a squared error against an advantage-style target $r(x, y_i) - \bar r(x)$ minus a KL penalty toward the old policy. Unlike PPO, it has no clipping; unlike DPO, it admits arbitrary scalar rewards rather than only pairwise preferences.

The optimizer for the RL stage is also Muon (with QK-Clip carried over), which the paper claims is the first published RL run at this scale on a Muon-family optimizer. Three additional regularizers stabilize the run:

Token-budget penalty. Each task $x$ is assigned a max-response budget $B(x)$ ; responses exceeding $B(x)$ are truncated and assigned a penalty. This counteracts the well-known "RL makes responses longer" effect on tasks where verbosity is not actually rewarded.
PTX auxiliary loss. A small fraction of pre-training-quality tokens are mixed into the RL gradient batch, weighted by an auxiliary PTX loss term, to prevent forgetting of the SFT distribution.
Temperature decay. Sampling temperature $T$ starts high (encouraging exploration on creative-writing and reasoning tasks) and decays linearly toward $1.0$ over training, shifting from exploration to exploitation.

Self-critique rubric reward

For non-verifiable tasks (creative writing, open-ended dialogue, faithfulness), K2 introduces a Self-Critique Rubric Reward. The model itself acts as a critic: given a prompt and $K$ candidate responses, it ranks them by performing pairwise comparisons against three rubric sets: core rubrics (the model's invariant identity values), prescriptive rubrics (anti-reward-hacking constraints), and human-annotated rubrics (instruction-specific). The pairwise comparisons are aggregated into a scalar reward via Bradley-Terry-style ranking.

The critic is bootstrapped from a curated open-source plus in-house preference dataset during SFT. During RL, the critic itself is closed-loop refined: on-policy rollouts from verifiable-reward prompts continually update the critic via supervised loss against the verifiable reward, transferring objective performance signal into the critic's evaluation. Over training, the critic and the policy improve together; the critic ratchets its evaluation bar in lockstep with policy quality. This is the K2-specific implementation of constitutional-AI-style self-supervised alignment, with the difference that the critic gets signal from verifiable tasks in real time rather than from a fixed constitution.

Agentic data synthesis

For tool-use post-training, K2's pipeline (paper Section 3.1, Figure 8) constructs $\sim 23{,}000$ tools by combining $3{,}000+$ real Model Context Protocol (MCP) tools from GitHub with $20{,}000$ synthetic tools generated through a hierarchical domain ontology. From this tool repository it synthesizes thousands of distinct agent system prompts, each equipped with a different tool subset, then generates rubric-paired tasks per agent, then runs a multi-agent simulation (user agent, target agent, tool simulator, judge agent) to produce trajectories. A judge agent filters trajectories against the task rubric; only successful ones enter SFT data. For coding and software engineering, simulated tool execution is replaced with real Kubernetes-backed sandboxes that run actual code and unit tests. The combined output trains both the SFT distribution and provides verifiable-reward signal for the RL stage.

What Kimi K2 does not claim

It is not a chain-of-thought ("thinking") model in the o1/DeepSeek-R1 sense. The reported numbers are non-thinking baselines. The paper notes that thinking-mode extensions are an active area but not the focus.

It does not provide a sharp ablation isolating MuonClip from the rest of the architectural and data choices. The reported gains on standard benchmarks are joint effects of optimizer, sparsity, MLA, data rephrasing, and post-training. The optimizer stability claim ("zero loss spikes across 15.5T tokens") is the cleanest single-variable claim; the benchmark numbers are not.

It does not provide a learned-reward-model RL recipe (RLHF in the Christiano-2017 sense). The reward signal is verifiable for math/code/instruction-following, and self-critique-rubric for everything else. There is no separately trained reward model.

Connections to TheoremPath Topics

Mixture of experts — the architectural backbone; sparsity-48 is at the high end of currently published configurations.
Optimizer theory: SGD, Adam, Muon — Muon is the immediate predecessor; MuonClip is what makes it work at trillion-parameter scale.
Preconditioned optimizers — the Newton-Schulz orthogonalization is a structured preconditioner that approximates the spectrum-flattening effect of full second-order methods.
DeepSeek models — the architectural template K2 inherits (MLA, MoE design).
Llama and open-weight models — peer family at the open-weight frontier.
Attention variants and efficiency — Multi-head Latent Attention is one of the leading efficient-attention variants for inference.
Reinforcement Learning from Human Feedback (deep dive) — the broader algorithmic family the K2 RL stage belongs to.
Agentic RL and tool use — the tool-use post-training story; K2 is the most-detailed open-weight account of agentic SFT and RL.
Post-training overview — the broader pipeline (SFT, rejection sampling, RL with verifiable and critic rewards).
RLHF and alignment — the self-critique-rubric reward is in the constitutional-AI lineage.

Why It Matters Now

K2 is the first open-weight release at trillion-parameter MoE scale to use a non-AdamW optimizer end-to-end and provide a complete recipe for stabilizing it. AdamW dominates the open frontier through inertia rather than evidence; smaller-scale Muon studies have shown 20–40% token-efficiency gains, but transferring those gains to trillion-parameter MoE training has been blocked on training-instability questions. MuonClip is the first published answer that holds across the full scale. The next round of open-weight pretraining runs will plausibly default to MuonClip-style optimizers; the AdamW monopoly is over.

The agentic post-training recipe is also worth flagging. The dominant open-weight pattern through 2024–2025 was: pretrain on web text, SFT on cleaned instruction data, RLHF or DPO on preference data. Tool-use was added downstream as fine-tuning. K2 inverts this — synthetic agent trajectories with real-execution sandbox grounding are a first-class part of the post-training distribution, not a downstream adapter — and the SWE-bench Verified score ( $65.8$ ) suggests this pays off in production-grade software-engineering tasks where tool use is the bottleneck. Expect every serious open-weight LLM after K2 to have a similar pipeline.

The architectural lesson on sparsity is more specific. K2's $1.69\times$ FLOP reduction at sparsity- $48$ is consistent with smaller-scale MoE scaling laws, but pushing to sparsity- $48$ at trillion-parameter scale has serious infrastructure cost — the all-to-all communication for routing 8 active experts out of 384 across nodes is intricate to make efficient. Moonshot has demonstrated this is doable; the open question is whether sparsity- $96$ or higher can be made to work, and whether the FLOP-loss curve continues monotonic at those points or breaks.

The reduction in attention heads is a second architectural data point against the conventional wisdom (DeepSeek-V3's heuristic of "heads $= 2 \cdot$ layers"). At the long-context regime where agentic workloads live, KV-cache and attention-FLOPs scale linearly in head count for the long-context part; the head count had been chosen for short-context bandwidth utilization, which is the wrong objective once context is $>128k$ . Halving heads is a small architectural change with a large inference-cost dividend.

References

Canonical:

Kimi Team. (2026). "Kimi K2: Open Agentic Intelligence." arXiv preprint (technical report). arXiv:2507.20534. Model checkpoints: huggingface.co/moonshotai/Kimi-K2-Instruct.

Direct precursors (Muon and MoE):

Jordan, K. et al. (2024). "Muon: An optimizer for hidden layers in neural networks." Blog post / arXiv. The base optimizer K2 builds on.
Liu, A. et al. (2024). "DeepSeek-V3 Technical Report." arXiv:2412.19437. Source of MLA, the architecture template, and many of the MoE design choices.
DeepSeek-AI. (2024). "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model." arXiv:2405.04434. Original MLA paper.
Shazeer, N. et al. (2017). "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer." ICLR 2017. arXiv:1701.06538. The MoE template.

Pre-training data and rephrasing:

Maini, P. et al. (2024). "Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling (WRAP)." ICLR 2024. arXiv:2401.16380. The rephrasing-for-token-efficiency idea K2 extends.
Kimi Team. (2025). "Kimi k1.5: Scaling Reinforcement Learning with LLMs." arXiv:2501.12599. The K1.5 RL algorithm K2 inherits.

RL and alignment lineage:

Christiano, P. F. et al. (2017). "Deep Reinforcement Learning from Human Preferences." NeurIPS 2017. arXiv:1706.03741. Original RLHF.
Bai, Y. et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." arXiv:2212.08073. The self-critique-rubric lineage.
Schulman, J. et al. (2017). "Proximal Policy Optimization Algorithms." arXiv:1707.06347. PPO; K2's KL-regularized squared-error objective is in the same family.

Attention-logit instability fixes (alternatives to QK-Clip):

Team, Gemma et al. (2024). "Gemma 2: Improving Open Language Models at a Practical Size." arXiv:2408.00118. Logit soft-cap.
Henry, A. et al. (2020). "Query-Key Normalization for Transformers." Findings of EMNLP 2020. arXiv:2010.04245. QK-Norm; not applicable to MLA.

Standard textbook:

Prince, S. J. D. (2023). Understanding Deep Learning. MIT Press. Chapter 12 — transformers and attention; Chapter 16 — pretraining and post-training.

Connected topics

Last reviewed: May 6, 2026