Formal Verification
AlphaProof and AI-Assisted Theorem Proving
AlphaProof (DeepMind, July 2024) earned an IMO silver medal solving 4 of 6 problems by autoformalizing into Lean 4 and running a reinforcement-learning theorem prover. AlphaGeometry 2 solved IMO 2024 Problem 4 in geometry. The state of AI-assisted formal mathematics in 2026: autoformalization, retrieval-augmented theorem proving, and the LLM + Lean kernel pipeline.
Prerequisites
Why This Matters
In July 2024, Google DeepMind announced AlphaProof and AlphaGeometry 2: AI systems that earned an IMO silver medal on the 2024 International Mathematical Olympiad (July 11-22, 2024 in Bath, UK). Together they solved 4 of the 6 problems, scoring 28 of 42 points, one point short of the gold-medal threshold. AlphaProof tackled algebra and number-theory problems via Lean 4; AlphaGeometry 2 solved the geometry problem.
This was the first time an AI system reached medal-level performance on the IMO under the standard rules (translation: same problem statement; bounded compute time; proofs verified by mathematicians, not the system itself). The result is a structural shift in the AI-assisted theorem proving landscape and a concrete demonstration that the formal-language + LLM approach scales beyond toy problems.
This page surveys the 2026 landscape:
- The AlphaProof / AlphaGeometry 2 architecture.
- The autoformalization pipeline (LLM translates English → Lean 4).
- The reinforcement-learning theorem prover layer.
- Prior work the line builds on (GPT-f, Polu-Sutskever 2020; LeanDojo / ReProver, Yang et al. 2023).
- Where this technology is going through 2026.
The AI-theorem-proving system landscape
| System | Year | Architecture | Substrate | Open / closed | Headline result |
|---|---|---|---|---|---|
| AlphaProof | 2024 | LLM autoformalizer + RL prover + Lean kernel | Lean 4 / Mathlib | closed (DeepMind) | IMO 2024 silver, 4 / 6 problems |
| AlphaGeometry 2 | 2024 | LLM proposer + symbolic geometry engine | custom geometry DSL | closed (DeepMind) | IMO 2024 problem 4 (geometry) |
| GPT-f | 2020 | Transformer trained on Metamath | Metamath | closed | first credible LM theorem prover |
| HyperTree Proof Search | 2022 | tree search + LLM proposer | Lean / Metamath | closed (Meta) | substantial miniF2F gains |
| LeanDojo / ReProver | 2023 | retrieval-augmented LLM | Lean 4 / Mathlib | open | miniF2F state-of-the-art at release |
| Llemma | 2024 | math-pretrained base model (7B / 34B) | text + Lean / Coq / Isabelle | open weights | base model for downstream provers |
| Lean Copilot | 2024 | tactic-suite LLM integration | Lean 4 / Mathlib | open | interactive premise selection + proof search |
| DeepSeek-Prover-V2 | 2025 | RL self-play on Lean | Lean 4 / Mathlib | open weights (7B and 671B) | competitive with proprietary on miniF2F |
| Goedel-Prover-V2 | 2025 | self-correction loop on Lean | Lean 4 / Mathlib | open weights | state-of-the-art miniF2F-test pass rate |
The pattern is consistent: every system separates a fast, fallible proposer (the LLM or RL agent) from a slow, sound verifier (Lean kernel, Coq kernel, custom DSL, or SMT). The proposer is trained to generate proof attempts; the verifier rejects anything that does not type-check. This is the AlphaZero pattern (Silver et al. 2017) lifted to mathematics: search + neural-network policy + ground-truth verifier.
The AlphaProof Architecture
DeepMind's published description (Hubert et al. 2024 blog post; the system has not yet been peer-reviewed in a paper as of this writing) gives the following architecture:
-
Autoformalization layer (LLM). A specialized language model translates the problem statement from English mathematical prose into Lean 4 syntax. The output is a
theorem ... : ... := by sorryskeleton with the formal statement filled in. -
Search layer (RL theorem prover). A reinforcement-learning agent generates Lean 4 proof attempts by predicting tactic sequences. The agent is trained on a curriculum of problem-proof pairs from mathlib4 and synthetic problem generation. AlphaZero-style self-play across millions of problem variants drives the policy.
-
Verifier layer (Lean kernel). Every candidate proof is checked by Lean 4's kernel. If the proof type-checks, the problem is considered solved. No reliance on the model's confidence, kernel verification is the trust layer.
Kernel-Verified Soundness Reduction
Statement
For an AlphaProof-style pipeline that emits a Lean term as
a candidate proof of a theorem , the soundness of the
end-to-end system reduces to the soundness of the Lean 4
kernel. Concretely: if kernel.check t T returns true, then
is a theorem of Lean's logic (under the imported
environment); the AI components — autoformalizer, RL
proposer, retrieval — are not in the trust base.
Intuition
The AI is a guesser; the kernel is the judge. A guess that fails to type-check is rejected and contributes nothing. A guess that type-checks is, by definition, a Lean proof of . The AI's training process, biases, and even active deception cannot produce a wrong "proof" that the kernel accepts, because acceptance by the kernel is the definition of correctness. This is the same trust-base argument that makes Coq's small kernel design valuable: by minimizing the trusted code, you minimize what can go wrong.
Why It Matters
This is the architectural reason AlphaProof is qualitatively different from a chain-of-thought language model that emits "proofs" as English. Chain-of-thought proofs require human review for correctness; kernel-checked Lean proofs require human review only of the theorem statement. The autoformalization step is therefore the remaining trust hole: if the autoformalizer mistranslates "for all " as "for all ," the kernel will happily verify the wrong theorem.
Failure Mode
Soundness fails if (a) the kernel itself has a bug, (b) an
imported axiom is inconsistent (e.g., a buggy axiom
declaration in user code), or (c) the autoformalized statement
diverges from the human-intended statement. (a) and (b) are
also failure modes for human Lean users; (c) is specific to
the AI pipeline and is the active research frontier.
The system reportedly used "a few days" of compute per problem during the IMO event, orders of magnitude more than a human contestant's 4.5 hours, but a finite budget that proves the technology is at the threshold of practical applicability.
The architectural pattern is the same as AlphaZero (Silver et al. 2017) applied to theorem proving: search + neural-network policy + ground-truth verifier. The kernel plays the role of the chess-rule verifier; the RL agent plays the role of MCTS-guided move selection.
The AlphaGeometry 2 Track
AlphaGeometry 2 is the geometry-specific system; it solved IMO 2024 Problem 4 (a synthetic-geometry problem about a triangle and its circumcircle). Its predecessor, AlphaGeometry (Trinh et al., Nature 2024), was IMO-bronze level on a held-out olympiad-geometry test set.
The architecture differs from AlphaProof:
- Symbolic engine. A custom geometry solver applies classical geometric construction rules (extending lines, drawing circles, etc.).
- LLM proposer. A language model proposes auxiliary constructions that guide the symbolic engine toward a solution.
- Synthetic training. Trained from 100M synthetic geometry problems generated programmatically; no human proofs needed.
This is a neuro-symbolic hybrid rather than a pure RL+kernel system. The symbolic engine plays the same role as Lean's kernel: a ground-truth verifier the LLM cannot fool. AlphaGeometry 2's improvements over the original include better synthetic data and a stronger LLM proposer.
Background: GPT-f and the LLM Theorem-Proving Line
The AlphaProof line builds on a decade of work on neural theorem proving:
- GPT-f (Polu and Sutskever, 2020): the first GPT-style Transformer trained on Metamath proofs. Showed that language models could generate proof scripts above chance, but performance was modest.
- Lean theorem proving with HOList (Bansal et al. 2019): early supervised learning on Coq + HOL.
- PACT (Han et al. 2022): co-training on autoformalization and theorem proving for Lean.
- HyperTree Proof Search (Lample et al. 2022): tree search on tactic sequences combined with LLM proposers.
- LeanDojo + ReProver (Yang et al., NeurIPS 2023): retrieval-augmented theorem prover; selects relevant Mathlib lemmas as context for the LLM. Open-source.
- Llemma (Azerbayev et al. 2024): a 7B / 34B parameter language model trained on math-heavy text + code, designed as a base for theorem-proving fine-tuning.
The AlphaProof advance, by reports, was substantially improved data efficiency through autoformalization-based synthetic data and stronger RL on the proof-search layer. The exact technical details await a published paper.
Autoformalization
The bottleneck for AlphaProof and similar systems is getting the problem statement into Lean 4 in the first place. This is autoformalization, translating mathematical English into a formal statement in a proof assistant.
Autoformalization
Autoformalization is the task of translating a
natural-language mathematical statement (definition, lemma,
theorem) into a syntactically valid term in a formal proof
assistant such as Lean 4, Coq, or Isabelle/HOL. Concretely:
given an English sentence such as "every continuous function
on a compact metric space attains its maximum," produce the
corresponding Lean 4 term, typically of the shape theorem extreme_value_theorem ... : ... := by ..., with the type
signature filled in (proof body left as sorry for a
downstream prover to complete). Modern pipelines also include
an autoinformalization loop: translate Lean back to English
and check semantic agreement. Autoformalization rates on
undergraduate exercises were around 30-40% in 2022 (Wu et
al.) and have climbed past 60% on definitions / 50% on
theorem statements by 2025 with retrieval-augmented frontier
models. Proofs remain unreliable.
The state of autoformalization in 2026:
- Granite-3-Mathlib (IBM, 2024): a fine-tuned Llama-2 70B model trained on Mathlib4 + retrieval. Reports 60-70% accuracy on autoformalizing real-analysis exercises.
- ByT5-Mathlib (Wu et al. 2022, updated through 2024): byte-level transformer for autoformalization with improved accuracy on definitions vs. theorems.
- Wu et al. 2022 ("Autoformalization with Large Language Models"): the foundational paper; showed Codex and PaLM could autoformalize undergraduate math at moderate rates.
- draft + verify pipelines: LLM produces a draft Lean statement; kernel + targeted retests verify it; human or another LLM corrects errors.
For now, autoformalization is reliable for definitions and basic theorem statements but unreliable for proofs. The AlphaProof pipeline succeeds because the proof-search component is the heavy lifter; the autoformalization just has to nail the statement.
What This Means for Mathematicians
Several practical consequences in 2026:
-
Mathlib4 contributors increasingly use AI-assisted workflows. Tools like
Llemma-tunedaesopextensions, retrieval-basedexact?augmentation, and Lean Copilot (Song et al. 2024) are now standard in serious Lean formalization projects. -
Olympiad-level problems are partially in reach. AlphaProof's IMO performance suggests that within 2-5 years, AI systems may consistently solve IMO problems under standard rules, possibly faster than human contestants (compute-wise; IMO time limits are the binding constraint).
-
PFR formalization (Tao 2023) showed that a major research-level result (the Polynomial Freiman-Ruzsa theorem) can be formalized in Lean 4 in weeks rather than years. AI assistance shortens this further.
-
AlphaGeometry-style systems for non-geometric domains. The neuro-symbolic pattern (LLM proposer + symbolic verifier) is being applied to combinatorics, number theory, and analysis. The challenge is finding the right symbolic-engine target for each domain.
-
Research mathematicians as Lean operators. The role of "mathematician fluent in Lean" is becoming distinct from "Lean engineer." Mathlib4 contributors increasingly are publishing mathematicians using Lean as a research tool.
Limitations and Open Problems
Despite the IMO result, several practical limitations remain:
- Compute cost. AlphaProof reportedly used "days" of compute per problem. Real-time interactive use requires orders of magnitude speedup.
- Statement coverage. Autoformalization is only reliable for ~60% of typical undergraduate / olympiad statements; the remaining 40% require human assistance.
- Long proofs. AlphaProof excels at short, tactical proofs (10-100 lines). Multi-page proofs requiring deep conceptual moves remain hard.
- Domain transfer. Performance on number theory and algebra exceeds performance on combinatorics or analysis. No clear unified architecture yet.
- Black-box risk. AI-generated proofs can be hundreds of lines of opaque tactic invocations. The kernel verifies them, but humans cannot easily understand or maintain them.
The Path-Network Connection
For TheoremPath and the path-network domains:
- ProofsPath: AI-assisted theorem proving is the natural successor to traditional olympiad coaching. The craft of mathematical proof writing remains valuable; the AI becomes another collaborator at the proof.
- TheoremPath: ML theorems formalized in Lean 4 (via the formalizing-ml-theorems-in-lean-and-coq page) form the substrate that AlphaProof and similar systems can prove against.
- ComputationPath: the architecture of AlphaProof illustrates the RL on a verifiable substrate pattern that defines modern AI-for-math; it sits naturally in the computability + verification toolkit.
- DSAPath: tree search and retrieval are core algorithmic primitives that AlphaProof relies on.
Common Mistakes
Treating AlphaProof as a 'theorem-proving GPT'
AlphaProof is not just an LLM. It's an AlphaZero-style RL agent with the Lean kernel as the verifier. The LLM component (autoformalization, proposal generation) is necessary but not sufficient.
Confusing IMO performance with general theorem proving
IMO problems are short, well-specified, and have clean solutions. AlphaProof's success on IMO doesn't mean it can prove arbitrary research mathematics, those proofs are longer, more open-ended, and require domain-specific mathematical insight.
Assuming the Lean kernel can be tricked
The Lean kernel is the trust layer; AlphaProof cannot "convince" it of false claims. Any proof AlphaProof produces is a correct proof in Lean 4, the AI's job is to find the proof, not to be trusted about the proof.
Overestimating autoformalization reliability
For research-level statements, autoformalization is still unreliable. Robust pipelines combine LLM drafting with human review (or with a verifier loop that retries on type-checking failure). Pure LLM autoformalization is not production-ready.
Exercises
Problem
Why is the Lean kernel the trust boundary for AlphaProof, not the RL agent or the LLM?
Problem
Sketch how an AlphaGeometry-style approach (LLM proposer + symbolic engine) might be adapted to combinatorial olympiad problems, where the symbolic substrate is less obvious.
Cross-Network Links
- TheoremPath: theorem-proving-in-lean is the foundational prerequisite. ML-theorem case studies, a Lean tactic cookbook, and a treatment of reasoning-data curation (reasoning-data-curation) are planned companion pages.
- ProofsPath (forthcoming,
content/proofspath-topics/): olympiad technique pages cover the human-style techniques AlphaProof must learn (induction, pigeonhole, vieta jumping, invariants and monovariants, the extremal principle). The Lean 4 wrappers landing in TheoremPath via PRs #185-#188 are the bridge between the human craft and the AI prover. - ComputationPath (parked): the architecture of AlphaProof is the canonical RL on a verifiable substrate pattern. Until ComputationPath ships, the relevant overview lives on TheoremPath at theorem-proving-in-lean.
References
Canonical:
- Google DeepMind. "AI achieves silver-medal standard solving International Mathematical Olympiad problems." Blog post, July 25 2024. The official AlphaProof / AlphaGeometry 2 announcement; no peer-reviewed paper published as of this writing.
- Trinh, Wu, Le, He, Luong. "Solving olympiad geometry without human demonstrations." Nature 625 (2024) 476-482, doi link. The AlphaGeometry (predecessor) paper; describes the symbolic-engine + LLM-proposer architecture and the synthetic-data pipeline that AlphaGeometry 2 inherits.
- Polu and Sutskever. "Generative Language Modeling for Automated Theorem Proving." arXiv:2009.03393 (2020). GPT-f, the foundational paper for the LLM-as-proof-script-generator line.
- Silver et al. "Mastering the game of Go without human knowledge." Nature 550 (2017), 354-359. The AlphaZero paper; the AlphaProof RL+verifier pattern is its direct descendant.
Current:
- Yang et al. "LeanDojo: Theorem Proving with Retrieval-Augmented Language Models." NeurIPS 2023, project page. Open-source retrieval-augmented theorem prover; the public benchmark for LLM-driven Lean proving.
- Azerbayev et al. "Llemma: An Open Language Model for Mathematics." ICLR 2024, arXiv:2310.10631. 7B / 34B base model trained on Proof-Pile-2; the standard open-weight starting point for theorem-proving fine-tuning.
- Song et al. "Lean Copilot: Large Language Models as Copilots for Theorem Proving in Lean." arXiv:2404.12534 (2024). The Lean Copilot tactic suite (
select_premises,suggest_tactics,search_proof). - Wu et al. "Autoformalization with Large Language Models." NeurIPS 2022, arXiv:2205.12615. The first systematic study of Codex / PaLM autoformalization rates on undergraduate math.
- Mathlib Community. "Mathlib4: The Lean Mathematical Library." leanprover-community.github.io, ongoing. The substrate every modern Lean-based theorem prover targets; over 1.2M lines as of 2026.
- Tao. "Formalizing the proof of PFR in Lean 4 - a personal account." Blog post, December 2023. The Polynomial Freiman-Ruzsa formalization; case study for what AI assistance can shorten in research-level formalization.
Frontier:
- DeepSeek-Prover team. "DeepSeek-Prover-V2." arXiv:2504.14589 (2025). Open-weight RL-trained prover competitive with proprietary frontier on miniF2F; 7B and 671B-MoE variants.
- Goedel-Prover team. "Goedel-Prover-V2." arXiv:2502.07640 (2025). Self-correction-driven Lean prover; reaches state-of-the-art miniF2F-test pass rates without RL on a held-out set.
- PutnamBench. "Multilingual Competition Mathematics Benchmark for Formal Theorem-Proving." arXiv:2407.11214 (2024). The Putnam-level evaluation harness across Lean, Coq, Isabelle.
Related Topics
Last reviewed: April 28, 2026
Canonical graph
Required before and derived from this topic
These links come from prerequisite edges in the curriculum graph. Editorial suggestions are shown here only when the target page also cites this page as a prerequisite.
Required prerequisites
1- Ineffable Intelligencelayer 4 · tier 2
Derived topics
1- Reasoning Data Curationlayer 5 · tier 2
Graph-backed continuations