MiniF2F-Dafny: LLM-Guided Mathematical Theorem Proving via Auto-Active Verification
- URL: http://arxiv.org/abs/2512.10187v1
- Date: Thu, 11 Dec 2025 00:52:19 GMT
- Title: MiniF2F-Dafny: LLM-Guided Mathematical Theorem Proving via Auto-Active Verification
- Authors: Mantas Baksys, Stefan Zetzsche, Olivier Bouissou,
- Abstract summary: We present miniF2F-Dafny, the first translation of the mathematical reasoning benchmark miniF2F to an automated theorem prover.<n>We find that Dafny's automation verifies 99/244 (40.6%) of the test set and 109/244 (44.7%) of the validation set with empty proofs.
- Score: 0.5753241925582828
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present miniF2F-Dafny, the first translation of the mathematical reasoning benchmark miniF2F to an automated theorem prover: Dafny. Previously, the benchmark existed only in interactive theorem provers (Lean, Isabelle, HOL Light, Metamath). We find that Dafny's automation verifies 99/244 (40.6%) of the test set and 109/244 (44.7%) of the validation set with empty proofs--requiring no manual proof steps. For problems where empty proofs fail, we evaluate 12 off-the-shelf LLMs on providing proof hints. The best model we test achieves 55.7% pass@4 success rate employing iterative error correction. These preliminary results highlight an effective division of labor: LLMs provide high-level guidance while automation handles low-level details. Our benchmark can be found on GitHub at http://github.com/dafny-lang/miniF2F .
Related papers
- HybridProver: Augmenting Theorem Proving with LLM-Driven Proof Synthesis and Refinement [7.702809989052384]
HybridProver is a dual-model proof framework that combines tactic-based generation and whole-proof synthesis.<n>We implement HybridProver for the Isabelle theorem prover and fine-tune LLMs on our optimized datasets.
arXiv Detail & Related papers (2025-05-21T16:45:43Z) - APOLLO: Automated LLM and Lean Collaboration for Advanced Formal Reasoning [16.8655558789989]
We present APOLLO (Automated PrOof repair viaLLM and Lean cOllaboration), a model-agnostic agentic framework for automated theorem proving.<n>A set of agents analyze the proofs, fix the syntax errors, identify the mistakes in the proofs using Lean, isolate failing sub-lemmas, utilize automated solvers, and invoke an LLM on each remaining goal.<n>Our results demonstrate that targeted, compiler-guided repair of LLM outputs yields dramatic gains in both efficiency and correctness.
arXiv Detail & Related papers (2025-05-09T03:38:31Z) - STP: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving [33.61458249318183]
Self-play Theorem Prover (STP) takes on two roles, conjecturer and prover.<n>STP simultaneously takes on two roles, conjecturer and prover.<n>We evaluate with both Lean and Isabelle formal versifiers.
arXiv Detail & Related papers (2025-01-31T23:01:48Z) - ProofAug: Efficient Neural Theorem Proving via Fine-grained Proof Structure Analysis [50.020850767257095]
We propose ProofAug, a procedure that equips LLMs with automation methods at various granularities.<n>Our method is validated on the miniF2F benchmark using the open-source deep-math-7b-base model and the Isabelle proof assistant.<n>We also implement a Lean 4 version of ProofAug that can improve the pass@1 performance of Kimina-Prover-seek-Distill-1.5B from 44.3% to 50.4%.
arXiv Detail & Related papers (2025-01-30T12:37:06Z) - InternLM2.5-StepProver: Advancing Automated Theorem Proving via Critic-Guided Search [65.05674971652776]
A prevalent proof method involves the prover iteratively constructing the proof tactic by tactic, typically following a best-first search scheme.<n>We propose an intuitive yet effective method, which utilizes a critic model to capture the preference information.<n>A large-scale expert iteration with more than 20,000 CPU days is then applied to further fine-tune the prover and the critic.
arXiv Detail & Related papers (2024-10-21T07:18:23Z) - Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation [61.350306618479365]
Leakage of benchmarks can prevent the accurate assessment of large language models' true performance.
We propose Inference-Time Decontamination (ITD) to address this issue.
ITD reduces inflated accuracy by 22.9% on GSM8K and 19.0% on MMLU.
arXiv Detail & Related papers (2024-06-20T04:35:59Z) - Lean Copilot: Large Language Models as Copilots for Theorem Proving in Lean [81.94024084598598]
We introduce Lean Copilot, a framework for running LLM inference in Lean.<n>We build tools that suggest proof steps, complete proof goals, and select relevant premises.<n>When assisting humans, Lean Copilot requires only 2.08 manually-entered proof steps on average (3.86 required by aesop)<n>When automating the theorem proving process, Lean Copilot 74.2% proof steps on average, 85% better than aesop (40.1%).
arXiv Detail & Related papers (2024-04-18T22:54:08Z) - MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data [85.50740598523818]
MUSTARD is a framework that masters uniform synthesis of theorem and proof data of high quality and diversity.
We present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points.
We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data.
arXiv Detail & Related papers (2024-02-14T05:57:58Z) - Enhancing Neural Theorem Proving through Data Augmentation and Dynamic
Sampling Method [1.8130068086063336]
We introduce DS-Prover, a novel dynamic sampling method for theorem proving.
We augment the training dataset by decomposing simplification and rewrite tactics with multiple premises into tactics with single premises.
We achieve a state-of-the-art performance (Pass@1) of 14.2% on the ProofNet dataset and a performance of 29.8% on MiniF2F.
arXiv Detail & Related papers (2023-12-20T09:55:21Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.