Prover Agent: An Agent-Based Framework for Formal Mathematical Proofs
- URL: http://arxiv.org/abs/2506.19923v3
- Date: Mon, 29 Sep 2025 16:01:54 GMT
- Title: Prover Agent: An Agent-Based Framework for Formal Mathematical Proofs
- Authors: Kaito Baba, Chaoran Liu, Shuhei Kurita, Akiyoshi Sannai,
- Abstract summary: We present Prover Agent, a novel AI agent for automated theorem proving.<n>It integrates large language models (LLMs) with a formal proof assistant, Lean.<n>It achieves an 88.1% success rate on the MiniF2F benchmark.
- Score: 11.87831709160905
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Prover Agent, a novel AI agent for automated theorem proving that integrates large language models (LLMs) with a formal proof assistant, Lean. Prover Agent coordinates an informal reasoning LLM, a formal prover model, and feedback from Lean while also generating auxiliary lemmas. These auxiliary lemmas are not limited to subgoals in the formal proof but can also include special cases or potentially useful facts derived from the assumptions, which help in discovering a viable proof strategy. It achieves an 88.1% success rate on the MiniF2F benchmark, establishing a new state-of-the-art among methods using small language models (SLMs) with a much lower sample budget than previous approaches. We also present theoretical analyses and case studies that illustrate how these generated lemmas contribute to solving challenging problems.
Related papers
- Numina-Lean-Agent: An Open and General Agentic Reasoning System for Formal Mathematics [25.762535169301984]
We introduce Numina-Lean-Agent, which combines Claude Code with Numina-Lean-MCP to enable autonomous interaction with Lean.<n>Using Claude Opus 4.5 as the base model, Numina-Lean-Agent solves all problems in Putnam 2025 (12 / 12), matching the best closed-source system.
arXiv Detail & Related papers (2026-01-20T14:51:45Z) - HERMES: Towards Efficient and Verifiable Mathematical Reasoning in LLMs [32.234133057592935]
Hermes is a tool-assisted agent that interleaves informal reasoning with verified proof steps in Lean systems.<n>We evaluate Hermes on four challenging mathematical reasoning benchmarks using LLMs of varying parameter scales.
arXiv Detail & Related papers (2025-11-24T04:50:18Z) - Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics [1.2978846076301875]
Ax-Prover is a multi-agent system for automated theorem proving in Lean.<n>It can solve problems across diverse scientific domains and operate either autonomously or collaboratively with human experts.<n>We benchmark our approach against frontier LLMs and specialized prover models on two public math benchmarks and on two Lean benchmarks we introduce in the fields of abstract algebra and quantum theory.
arXiv Detail & Related papers (2025-10-14T17:57:04Z) - Solving Formal Math Problems by Decomposition and Iterative Reflection [30.54275542622631]
textbfDelta Prover orchestrates the interaction between a general-purpose LLM and the Lean 4 proof environment.<n>bftextDelta Prover achieves a state-of-the-art 95.9% success rate on the miniF2F-test benchmark.
arXiv Detail & Related papers (2025-07-21T03:56:35Z) - Towards Solving More Challenging IMO Problems via Decoupled Reasoning and Proving [48.22540519786074]
Recent studies show that the informal accuracy exceeds 80% while formal success remains below 8% on benchmarks like PutnamBench.<n>We propose a novel framework that decouples high-level reasoning from low-level proof generation.<n>We evaluate our method on a challenging set of post-2000 IMO problems, a problem set on which no prior open-source prover has reported success.
arXiv Detail & Related papers (2025-07-07T22:38:49Z) - Distilling LLM Agent into Small Models with Retrieval and Code Tools [57.61747522001781]
Agent Distillation is a framework for transferring reasoning capability and task-solving behavior from large language models into small language models.<n>Our results show that sLMs as small as 0.5B, 1.5B, 3B parameters can achieve performance competitive with next-tier larger 1.5B, 3B, 7B models.
arXiv Detail & Related papers (2025-05-23T08:20:15Z) - Towards Reliable Proof Generation with LLMs: A Neuro-Symbolic Approach [14.213719696233934]
Large language models (LLMs) struggle with formal domains that require rigorous logical deduction and symbolic reasoning.<n>We propose a neuro-symbolic approach that combines LLMs' generative strengths with structured components to overcome this challenge.
arXiv Detail & Related papers (2025-05-20T15:13:32Z) - APOLLO: Automated LLM and Lean Collaboration for Advanced Formal Reasoning [8.056359341994941]
APOLLO is a model-agnostic pipeline that combines the strengths of the Lean compiler with an LLM's reasoning abilities.<n>A new state-of-the-art accuracy of 75.0% is established on the miniF2F benchmark.
arXiv Detail & Related papers (2025-05-09T03:38:31Z) - BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning [78.63421517563056]
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks.<n>We present a unified probabilistic framework that formalizes LLM reasoning through a novel graphical model.<n>We introduce the Bootstrapping Reinforced Thinking Process (BRiTE) algorithm, which works in two steps.
arXiv Detail & Related papers (2025-01-31T02:39:07Z) - ProofAug: Efficient Neural Theorem Proving via Fine-grained Proof Structure Analysis [50.020850767257095]
We propose ProofAug, a procedure that equips LLMs with automation methods at various granularities.<n>Our method is validated on the miniF2F benchmark using the open-source deep-math-7b-base model and the Isabelle proof assistant.<n>We also implement a Lean 4 version of ProofAug that can improve the pass@1 performance of Kimina-Prover-seek-Distill-1.5B from 44.3% to 50.4%.
arXiv Detail & Related papers (2025-01-30T12:37:06Z) - LeanAgent: Lifelong Learning for Formal Theorem Proving [85.39415834798385]
We present LeanAgent, a novel lifelong learning framework for formal theorem proving.<n>LeanAgent continuously generalizes to and improves on ever-expanding mathematical knowledge.<n>It generates formal proofs for 155 theorems across 23 diverse Lean repositories.
arXiv Detail & Related papers (2024-10-08T17:11:24Z) - Textualized Agent-Style Reasoning for Complex Tasks by Multiple Round LLM Generation [49.27250832754313]
We present AgentCOT, a llm-based autonomous agent framework.
At each step, AgentCOT selects an action and executes it to yield an intermediate result with supporting evidence.
We introduce two new strategies to enhance the performance of AgentCOT.
arXiv Detail & Related papers (2024-09-19T02:20:06Z) - Autoformalizing Euclidean Geometry [74.72212706513318]
We introduce a neuro-symbolic framework for autoformalizing Euclidean geometry.
One challenge is that informal proofs rely on diagrams, leaving gaps in texts that are hard to formalize.
We provide automatic semantic evaluation for autoformalized theorem statements.
arXiv Detail & Related papers (2024-05-27T14:35:10Z) - Enhancing the General Agent Capabilities of Low-Parameter LLMs through Tuning and Multi-Branch Reasoning [56.82041895921434]
Open-source pre-trained Large Language Models (LLMs) exhibit strong language understanding and generation capabilities.
When used as agents for dealing with complex problems in the real world, their performance is far inferior to large commercial models such as ChatGPT and GPT-4.
arXiv Detail & Related papers (2024-03-29T03:48:12Z) - MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data [85.50740598523818]
MUSTARD is a framework that masters uniform synthesis of theorem and proof data of high quality and diversity.
We present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points.
We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data.
arXiv Detail & Related papers (2024-02-14T05:57:58Z) - Let's Reinforce Step by Step [10.65244642965387]
We use Reinforcement Learning from Human Feedback to shape model reasoning processes.
Our results show that the fine-grained reward provided by PRM-based methods enhances accuracy on simple mathematical reasoning.
We also show the critical role reward aggregation functions play in model performance.
arXiv Detail & Related papers (2023-11-10T01:35:51Z) - TRIGO: Benchmarking Formal Mathematical Proof Reduction for Generative
Language Models [68.65075559137608]
We propose TRIGO, an ATP benchmark that not only requires a model to reduce a trigonometric expression with step-by-step proofs but also evaluates a generative LM's reasoning ability on formulas.
We gather trigonometric expressions and their reduced forms from the web, annotate the simplification process manually, and translate it into the Lean formal language system.
We develop an automatic generator based on Lean-Gym to create dataset splits of varying difficulties and distributions in order to thoroughly analyze the model's generalization ability.
arXiv Detail & Related papers (2023-10-16T08:42:39Z) - Autoformalization with Large Language Models [22.86710743804944]
A successful autoformalization system could advance the fields of formal verification, program synthesis, and artificial intelligence.
We show large language models provide new prospects towards this goal.
Our methodology results in a new state-of-the-art result on the MiniF2F theorem proving benchmark, improving the proof rate from $29.6%$ to $35.2%$.
arXiv Detail & Related papers (2022-05-25T09:53:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.