Related papers: MiniF2F-Dafny: LLM-Guided Mathematical Theorem Proving via Auto-Active Verification

MiniF2F-Dafny: LLM-Guided Mathematical Theorem Proving via Auto-Active Verification

URL: http://arxiv.org/abs/2512.10187v1
Date: Thu, 11 Dec 2025 00:52:19 GMT
Title: MiniF2F-Dafny: LLM-Guided Mathematical Theorem Proving via Auto-Active Verification
Authors: Mantas Baksys, Stefan Zetzsche, Olivier Bouissou,
Abstract summary: We present miniF2F-Dafny, the first translation of the mathematical reasoning benchmark miniF2F to an automated theorem prover.<n>We find that Dafny's automation verifies 99/244 (40.6%) of the test set and 109/244 (44.7%) of the validation set with empty proofs.
Score: 0.5753241925582828
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present miniF2F-Dafny, the first translation of the mathematical reasoning benchmark miniF2F to an automated theorem prover: Dafny. Previously, the benchmark existed only in interactive theorem provers (Lean, Isabelle, HOL Light, Metamath). We find that Dafny's automation verifies 99/244 (40.6%) of the test set and 109/244 (44.7%) of the validation set with empty proofs--requiring no manual proof steps. For problems where empty proofs fail, we evaluate 12 off-the-shelf LLMs on providing proof hints. The best model we test achieves 55.7% pass@4 success rate employing iterative error correction. These preliminary results highlight an effective division of labor: LLMs provide high-level guidance while automation handles low-level details. Our benchmark can be found on GitHub at http://github.com/dafny-lang/miniF2F .

Related papers

HybridProver: Augmenting Theorem Proving with LLM-Driven Proof Synthesis and Refinement [7.702809989052384]
HybridProver is a dual-model proof framework that combines tactic-based generation and whole-proof synthesis.<n>We implement HybridProver for the Isabelle theorem prover and fine-tune LLMs on our optimized datasets.
arXiv Detail & Related papers (2025-05-21T16:45:43Z)
APOLLO: Automated LLM and Lean Collaboration for Advanced Formal Reasoning [16.8655558789989]
We present APOLLO (Automated PrOof repair viaLLM and Lean cOllaboration), a model-agnostic agentic framework for automated theorem proving.<n>A set of agents analyze the proofs, fix the syntax errors, identify the mistakes in the proofs using Lean, isolate failing sub-lemmas, utilize automated solvers, and invoke an LLM on each remaining goal.<n>Our results demonstrate that targeted, compiler-guided repair of LLM outputs yields dramatic gains in both efficiency and correctness.
arXiv Detail & Related papers (2025-05-09T03:38:31Z)
STP: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving [33.61458249318183]
Self-play Theorem Prover (STP) takes on two roles, conjecturer and prover.<n>STP simultaneously takes on two roles, conjecturer and prover.<n>We evaluate with both Lean and Isabelle formal versifiers.
arXiv Detail & Related papers (2025-01-31T23:01:48Z)
ProofAug: Efficient Neural Theorem Proving via Fine-grained Proof Structure Analysis [50.020850767257095]
We propose ProofAug, a procedure that equips LLMs with automation methods at various granularities.<n>Our method is validated on the miniF2F benchmark using the open-source deep-math-7b-base model and the Isabelle proof assistant.<n>We also implement a Lean 4 version of ProofAug that can improve the pass@1 performance of Kimina-Prover-seek-Distill-1.5B from 44.3% to 50.4%.
arXiv Detail & Related papers (2025-01-30T12:37:06Z)
InternLM2.5-StepProver: Advancing Automated Theorem Proving via Critic-Guided Search [65.05674971652776]
A prevalent proof method involves the prover iteratively constructing the proof tactic by tactic, typically following a best-first search scheme.<n>We propose an intuitive yet effective method, which utilizes a critic model to capture the preference information.<n>A large-scale expert iteration with more than 20,000 CPU days is then applied to further fine-tune the prover and the critic.
arXiv Detail & Related papers (2024-10-21T07:18:23Z)
Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation [61.350306618479365]
Leakage of benchmarks can prevent the accurate assessment of large language models' true performance. We propose Inference-Time Decontamination (ITD) to address this issue. ITD reduces inflated accuracy by 22.9% on GSM8K and 19.0% on MMLU.
arXiv Detail & Related papers (2024-06-20T04:35:59Z)
Lean Copilot: Large Language Models as Copilots for Theorem Proving in Lean [81.94024084598598]
We introduce Lean Copilot, a framework for running LLM inference in Lean.<n>We build tools that suggest proof steps, complete proof goals, and select relevant premises.<n>When assisting humans, Lean Copilot requires only 2.08 manually-entered proof steps on average (3.86 required by aesop)<n>When automating the theorem proving process, Lean Copilot 74.2% proof steps on average, 85% better than aesop (40.1%).
arXiv Detail & Related papers (2024-04-18T22:54:08Z)
MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data [85.50740598523818]
MUSTARD is a framework that masters uniform synthesis of theorem and proof data of high quality and diversity. We present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points. We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data.
arXiv Detail & Related papers (2024-02-14T05:57:58Z)
Enhancing Neural Theorem Proving through Data Augmentation and Dynamic Sampling Method [1.8130068086063336]
We introduce DS-Prover, a novel dynamic sampling method for theorem proving. We augment the training dataset by decomposing simplification and rewrite tactics with multiple premises into tactics with single premises. We achieve a state-of-the-art performance (Pass@1) of 14.2% on the ProofNet dataset and a performance of 29.8% on MiniF2F.
arXiv Detail & Related papers (2023-12-20T09:55:21Z)
LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.