Skip to main content

LLM Construction

Reasoning Data Curation

How to build training data for reasoning models: math with verified solutions, code with test cases, rejection sampling, external verification, self-play for problem generation, and the connection to RLVR.

AdvancedTier 2FrontierFrontier watch~45 min

Why This Matters

The models that reason well (o1, DeepSeek-R1, Claude extended-thinking models) were not trained on generic internet text alone. Their post-training mixes include curated reasoning traces, and in some publicly documented cases (notably DeepSeek-R1, 2025) the pipeline explicitly filters for verified-correct solutions. OpenAI and Anthropic system cards describe broader data mixes rather than a verified-only pipeline, so the strong claim should not be projected onto the entire class. The quality of reasoning training data still determines the ceiling for reasoning capability. Scaling incorrect or unverified reasoning data makes models more fluent at being wrong, not better at being right.

Mental Model

The core principle: for reasoning tasks, you can verify correctness even when you cannot generate correct solutions. A math proof checker can verify a proof without being able to produce one. A test suite can verify code without writing it. This asymmetry between generation and verification is what makes reasoning data curation possible at scale.

The pipeline: (1) source or generate problems, (2) generate many candidate solutions, (3) verify solutions with external tools, (4) keep only verified correct solutions, (5) train on the verified data.

Curate reasoning data by turning generation into a verified funnel

The model does not need to be reliable on every attempt. It only needs to produce enough candidates that an external verifier can separate good reasoning traces from fluent garbage.

Generate problemsbenchmarks, self-play, synthetic curriculaSample many solutionsbest-of-N, temperature sweeps, retriesExternal verificationtests, proof checkers, symbolic gradersKeep only verified tracesclean supervised data or RLVR rewardverified funnellow-value traces: wrong final answer, failed tests, invalid proof, numerical mismatchhigh-value traces: passed tests, proof accepted, symbolic check cleared, answer matches ground truth

Why verification changes the game

Reasoning curation works because correctness can be checked more cheaply than it can be generated. A model may fail often, but best-of-N plus a trustworthy verifier can still yield excellent training data.

What still goes wrong

Final-answer filtering can keep logically bad chains that accidentally land on the right answer. Good pipelines mix outcome verification with stronger process checks, hidden tests, or formal proof tools where possible.

Production reality

The expensive step is not storing the kept traces. The expensive step is generating enough candidates and running enough external checks that the retained set is genuinely clean.

Types of Reasoning Data

Definition

Verifiable Reasoning Data

Verifiable reasoning data consists of (problem, solution) pairs where the solution's correctness can be checked by an external system:

  1. Math with ground truth: competition problems with known numerical answers, theorem proving with formal proof assistants (Lean, Coq, Isabelle)
  2. Code with test cases: programming problems where solutions are checked by running test suites, including hidden test cases that the model never sees during training
  3. Science with known answers: physics, chemistry, and biology problems where the answer can be derived from first principles or looked up
  4. Logic puzzles: constraint satisfaction problems where a solution can be verified by checking all constraints

The key property: verification is cheap and reliable, even when generation is hard.

Rejection Sampling

Proposition

Rejection Sampling Improves Solution Quality

Statement

If a model generates NN independent solutions to a problem and each solution is correct with probability pp, then the probability that at least one correct solution exists among the NN samples is:

P(at least one correct)=1(1p)NP(\text{at least one correct}) = 1 - (1-p)^N

The expected number of correct solutions is NpNp. If we select uniformly at random from the correct solutions, the resulting training example is guaranteed correct (given a perfect verifier). To achieve P(at least one correct)1δP(\text{at least one correct}) \geq 1 - \delta:

Nln(1/δ)ln(1/(1p))ln(1/δ)pfor small pN \geq \frac{\ln(1/\delta)}{\ln(1/(1-p))} \approx \frac{\ln(1/\delta)}{p} \quad \text{for small } p

Intuition

Generate many attempts, keep the ones that work. If the model solves a problem 10% of the time, generating 50 attempts gives a 99.5% chance of at least one correct solution. The verified correct solutions become high-quality training data. The model is effectively distilling its own best-case behavior into reliable behavior.

Proof Sketch

Each sample is correct independently with probability pp. The probability all NN fail is (1p)N(1-p)^N. So P(any correct)=1(1p)NP(\text{any correct}) = 1 - (1-p)^N. Setting this 1δ\geq 1 - \delta and solving: (1p)Nδ(1-p)^N \leq \delta, so Nlnδ/ln(1p)N \geq \ln\delta / \ln(1-p). Using ln(1p)p\ln(1-p) \approx -p for small pp gives the approximation.

Why It Matters

Rejection sampling is a primary method for generating reasoning training data at scale. DeepSeek-R1 (2025) documents it explicitly as part of its pipeline. Other reasoning model releases describe curated reasoning data without specifying the exact filtering procedure. The method works even when the base model's pass rate is low (say 1-5%), as long as you can afford to generate enough samples and verify them.

Failure Mode

If the verifier has false positives (accepts incorrect solutions), the training data is contaminated. If the base model's pass rate pp is extremely low (below 0.1%), the cost of generating enough samples becomes prohibitive. Also, rejection sampling only selects for final-answer correctness; it does not guarantee that the reasoning chain itself is valid.

Verification Methods

Definition

External Verification

External verification uses tools outside the model to check solution correctness:

  1. Code execution: run the generated code against test cases (both visible and hidden). Check for correctness, not just compilation
  2. Formal proof assistants: Lean 4, Coq, Isabelle can type-check proofs for mathematical theorems. If the proof compiles, it is correct by construction
  3. Symbolic math checkers: computer algebra systems (SymPy, Mathematica) can verify numerical answers and simplify expressions
  4. Unit tests for reasoning: for word problems, verify the final answer against known ground truth

External verifiers are the gold standard because they do not rely on the model's own judgment, which is unreliable for hard problems.

Process Reward Models (PRMs) vs Outcome Reward Models (ORMs)

When external verification is unavailable, learned verifiers substitute:

  • ORMs score the final answer only. Binary signal: correct or incorrect
  • PRMs score each step in the reasoning chain. Richer signal but requires step-level correctness labels, which are expensive to collect

PRMs produce better training signal because they can identify where reasoning went wrong, enabling the model to learn which steps to avoid. The cost is that PRM training data requires human annotation of individual reasoning steps.

Self-Play for Problem Generation

Definition

Self-Play Problem Generation

Self-play generates new training problems using the model itself:

  1. The model generates a problem (e.g., a math question)
  2. The model generates a solution to the problem
  3. An external verifier checks the solution
  4. If verified, the (problem, solution) pair becomes training data
  5. If the model cannot solve its own problem (low pass rate), the problem is "hard" and particularly valuable for training

This creates a curriculum: the model generates problems at the frontier of its own capability. Problems it solves easily are not useful for training. Problems it fails on entirely cannot produce verified solutions. The sweet spot is problems with pass rate between 1% and 50%.

The self-play loop is inspired by AlphaZero, where the system improves by playing against itself. For reasoning, "playing against yourself" means generating problems you can barely solve, then training on the solutions you do find.

Iterated Self-Training: STaR and ReST

Both STaR and ReST formalize the rejection-sampling-then-SFT recipe as an explicit iterated loop. Neither uses on-policy RL. Both reduce to repeated supervised fine-tuning on filtered self-generated data.

STaR (Zelikman et al., 2022)

STaR (Self-Taught Reasoner) bootstraps a chain-of-thought model from a small seed set of labeled problems. The loop, on a dataset of problems with known final answers:

  1. Sample a chain of thought and final answer from the current model for each problem
  2. Filter to chains whose final answer matches the ground-truth label
  3. For problems where no sampled chain gives the correct answer, run a rationalization step: feed the correct answer into the prompt as a hint, sample a chain that "explains" the known answer, then strip the hint from the training example
  4. Fine-tune the base model on the union of filtered chains and rationalized chains
  5. Repeat with the fine-tuned model

The critical trick is rationalization. Without it, the training set is biased toward easy problems the model already solves, and hard problems contribute nothing. By providing the answer as a hint only during data generation, STaR recovers training signal from problems the model cannot yet solve unaided, while the final training example still asks the model to produce the chain without the hint.

ReST (Gulcehre et al., 2023)

ReST (Reinforced Self-Training) replaces the correctness filter with a learned reward model and runs a Grow / Improve loop:

  1. Grow: sample many completions from the current policy on a fixed prompt set
  2. Improve: score completions with a reward model, keep the top-kk or those above a threshold, fine-tune the policy on the kept set
  3. Optionally run multiple Improve passes (with increasing thresholds) per Grow pass
  4. Repeat

ReST is strictly iterated supervised fine-tuning on reward-filtered data. There is no policy gradient, no importance sampling, and no on-policy KL constraint during the SFT step. Compared to RLHF, the infrastructure is simpler and the training is more stable; the cost is that off-policy data from earlier Grow steps becomes stale as the policy drifts.

ReSTEM^{EM} (Singh et al., 2023) reinterprets this loop as expectation-maximization on a latent-variable model where the latent is the reasoning chain. The E-step samples chains and filters by binary correctness (treating the verifier as an indicator function on the marginal likelihood). The M-step fits the policy to the filtered chains. Applied to MATH and APPS, ReSTEM^{EM} matches or exceeds human-data fine-tuning while using only model-generated chains.

Curation Regimes at a Glance

RegimeFeedback sourceOnline?What becomes supervisionMain risk
Rejection samplingfinal-answer verifiernoaccepted complete solutionsonly improves behaviors the current policy can already occasionally produce
STaRknown answer plus rationalized traceno, iterativefiltered and rationalized chainsanswer-conditioned traces can look cleaner than the model's unaided reasoning
ReST / ReSTEM^{EM}reward model or correctness filterno, iterativereward-filtered self-generated tracesstale off-policy data and reward-model bias
PRM / process supervisionstep-level labels or rollout-derived scoresusually nostep annotations or step scorersprocess labels are expensive or proxy-like
RLVRverifier reward during policy updatesyesupdated policy rather than a fixed datasetsparse rewards, weak test coverage, verifier hacking

Process Supervision and Synthetic Reasoning Data

PRM800K (Lightman et al., 2023)

"Let's Verify Step by Step" released PRM800K, a dataset of roughly 800,000 step-level correctness labels over solutions to MATH problems. Each step of each solution is annotated as correct, neutral, or incorrect. The dataset is used to train process reward models that score partial reasoning chains. On the MATH benchmark, process reward models trained on PRM800K outperform outcome-only reward models at selecting correct solutions from a pool of samples, particularly as the pool size grows. The paper established step-level supervision as a measurable improvement over outcome-only supervision on hard reasoning.

Math-Shepherd (Wang et al., 2023)

Math-Shepherd removes the human annotator from the PRM loop. Each step of a partial solution is scored by running tree-search rollouts from that step and estimating the probability that a completion from this prefix reaches a correct final answer. Steps with high estimated completion probability are labeled correct; low-probability steps are labeled incorrect. This yields automatic step-level labels at the cost of a large rollout budget, with no human annotation required.

AlphaProof (Google DeepMind, 2024)

AlphaProof combines a language model with Lean 4 as an external proof checker. Candidate proof steps are generated by the model, type-checked by Lean, and successful proofs are fed back as training data in an AlphaZero-style loop. AlphaProof, together with AlphaGeometry 2, reached silver-medal performance on the 2024 International Mathematical Olympiad, solving four of the six problems. The system illustrates verification-as-training-signal in its strongest form: formal proof assistants are zero-false-positive verifiers, so every accepted proof is guaranteed correct by construction.

Synthetic reasoning data at scale (Phi-4, Phi-4-reasoning, Orca)

Phi-4, Phi-4-reasoning, and the Orca series rely heavily on synthetic reasoning data generated by stronger teacher models, then filtered and rewritten before being used to train smaller students. The bet is that curated synthetic reasoning can substitute for scarce high-quality human reasoning data, provided the synthesis and filtering pipelines are themselves high quality.

Connection to RLVR

Proposition

Verification as RL Reward Signal

Statement

RL with verifiable rewards (RLVR) uses the verifier output as a sparse reward signal:

r(x,y)={1if verifier accepts solution y to problem x0otherwiser(x, y) = \begin{cases} 1 & \text{if verifier accepts solution } y \text{ to problem } x \\ 0 & \text{otherwise} \end{cases}

The RL objective is:

maxθExD,yπθ(x)[r(x,y)]βKL(πθπref)\max_\theta \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(\cdot|x)} [r(x, y)] - \beta \, \text{KL}(\pi_\theta \| \pi_{\text{ref}})

where πref\pi_{\text{ref}} is a reference policy (typically the SFT model) and β\beta controls the KL penalty. This is equivalent to rejection sampling followed by a KL-regularized policy update.

Intuition

Instead of training on pre-collected verified solutions (rejection sampling then SFT), RLVR trains the model online: generate, verify, update policy. The RL framework handles credit assignment automatically through the reward signal. The KL penalty prevents the model from collapsing to a narrow set of solution strategies.

Why It Matters

RLVR is the approach documented for DeepSeek-R1 (2025). Other reasoning systems use related but not always identical recipes; public information on o1 and Claude extended-thinking models does not fully specify the pipeline. The advantage over rejection sampling plus SFT is that the model can explore and discover new solution strategies during training, rather than being limited to strategies it already knew. The verifier provides a reliable training signal without requiring human preference labels.

Failure Mode

Sparse binary rewards make RL optimization difficult. If the model's pass rate is very low, the reward signal is almost always zero and learning stalls. Reward hacking is possible if the verifier has exploitable weaknesses (e.g., a test suite with insufficient coverage). The KL penalty is crucial: without it, the model collapses to generating only the simplest correct solutions.

Data Quality vs Data Quantity

For reasoning, quality dominates quantity. Key empirical findings:

  1. Correct > plentiful: training on 10,000 verified-correct math solutions outperforms training on 100,000 unverified solutions that include errors
  2. Hard > easy: training on problems the model finds difficult (pass rate 1-20%) produces more improvement than training on easy problems (pass rate > 80%)
  3. Diverse > repetitive: solutions that use different reasoning strategies for the same problem type produce more robust reasoning than many solutions using the same approach
  4. Process > outcome: when available, step-by-step verified chains produce better reasoners than outcome-only verification

Audit Questions For A Reasoning Dataset

Before trusting a reasoning-data pipeline, ask:

  • What exactly is verified: the final answer, each intermediate step, or only a learned score?
  • Can the generator see the test suite, answer format, or proof checker interface well enough to game it?
  • Are the kept solutions diverse, or are they near-duplicates of one template?
  • Is the frontier difficulty calibrated, or are you mostly collecting already-solved easy problems?
  • Does the kept trace justify the answer, or does it only coincide with a correct final output?

Common Confusions

Watch Out

Rejection sampling is not cherry-picking results

Rejection sampling generates training data, not evaluation results. You generate many solutions, keep the correct ones, and train on them. This is the training pipeline, not the evaluation protocol. At evaluation time, the model generates a single solution (or uses majority voting), and that solution is either correct or not.

Watch Out

A perfect verifier does not guarantee perfect reasoning

Even with a zero-false-positive verifier, the training data only contains solutions that happen to arrive at the correct final answer. The reasoning chain may contain errors that cancel out (e.g., two sign errors that compensate). This is why process reward models, which verify each step, produce stronger reasoners than outcome-only verification.

Watch Out

ReST is not policy-gradient RL

ReST samples outputs, filters them with a reward model or threshold, and then fine-tunes on the kept set. That is iterated supervised learning on filtered data, not PPO or REINFORCE. The infrastructure is simpler, but the data becomes stale as the policy drifts.

Watch Out

More compute for rejection sampling has diminishing returns

Generating N=1000N = 1000 samples when p=0.1p = 0.1 gives 1(0.9)100011 - (0.9)^{1000} \approx 1 probability of finding a correct solution. But generating 10,000 samples barely helps further. The marginal value of additional samples decreases rapidly once NpNp is large. The bottleneck shifts to problem diversity, not sample count.

Summary

  • Reasoning data requires verified correctness, not just fluency
  • Rejection sampling: generate many solutions, keep verified correct ones
  • External verifiers (code execution, proof assistants, symbolic checkers) are the gold standard
  • Self-play generates problems at the model's capability frontier
  • RLVR uses verifier output as RL reward, enabling online exploration
  • Data quality matters more than quantity: correct, hard, diverse solutions
  • Process verification (step-by-step) is stronger than outcome verification (final answer only)

Exercises

ExerciseCore

Problem

A model has a 5% pass rate on competition math problems. How many samples per problem do you need to generate to have at least a 95% chance of finding one correct solution?

ExerciseAdvanced

Problem

You are designing a self-play loop for math reasoning. The model currently solves 30% of problems at difficulty level dd and 2% at level d+1d+1. Should you primarily train on level dd problems (high pass rate) or level d+1d+1 problems (low pass rate)? Justify quantitatively by considering the cost per verified correct solution.

References

Canonical:

Current:

Next Topics

Last reviewed: April 23, 2026

Canonical graph

Required before and derived from this topic

These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.

Required prerequisites

3

Derived topics

2