HybridProver: Augmenting Theorem Proving with LLM-Driven Proof Synthesis and Refinement
- URL: http://arxiv.org/abs/2505.15740v1
- Date: Wed, 21 May 2025 16:45:43 GMT
- Title: HybridProver: Augmenting Theorem Proving with LLM-Driven Proof Synthesis and Refinement
- Authors: Jilin Hu, Jianyu Zhang, Yongwang Zhao, Talia Ringer,
- Abstract summary: HybridProver is a dual-model proof framework that combines tactic-based generation and whole-proof synthesis.<n>We implement HybridProver for the Isabelle theorem prover and fine-tune LLMs on our optimized datasets.
- Score: 7.702809989052384
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Formal methods is pivotal for verifying the reliability of critical systems through rigorous mathematical proofs. However, its adoption is hindered by labor-intensive manual proofs and the expertise required to use theorem provers. Recent advancements in large language models (LLMs) offer new opportunities for automated theorem proving. Two promising approaches are generating tactics step by step and generating a whole proof directly with an LLM. However, existing work makes no attempt to combine the two approaches. In this work, we introduce HybridProver, a dual-model proof synthesis framework that combines tactic-based generation and whole-proof synthesis to harness the benefits of both approaches. HybridProver generates whole proof candidates for evaluation directly, then extracts proof sketches from those candidates. It then uses a tactic-based generation model that integrates automated tools to complete the sketches via stepwise refinement. We implement HybridProver for the Isabelle theorem prover and fine-tune LLMs on our optimized Isabelle datasets. Evaluation on the miniF2F dataset illustrates HybridProver's effectiveness. We achieve a 59.4% success rate on miniF2F, where the previous SOTA is 56.1%. Our ablation studies show that this SOTA result is attributable to combining whole-proof and tactic-based generation. Additionally, we show how the dataset quality, training parameters, and sampling diversity affect the final result during automated theorem proving with LLMs. All of our code, datasets, and LLMs are open source.
Related papers
- LLM-based Automated Theorem Proving Hinges on Scalable Synthetic Data Generation [11.045086599038338]
We introduce a novel proof-state exploration approach for training data synthesis, designed to produce diverse tactics across a wide range of intermediate proof states.<n>We also propose an adaptive beam size strategy, which effectively takes advantage of our data synthesis method and achieves a trade-off between exploration and exploitation during tree search.
arXiv Detail & Related papers (2025-05-17T14:47:36Z) - APOLLO: Automated LLM and Lean Collaboration for Advanced Formal Reasoning [8.056359341994941]
APOLLO is a model-agnostic pipeline that combines the strengths of the Lean compiler with an LLM's reasoning abilities.<n>A new state-of-the-art accuracy of 75.0% is established on the miniF2F benchmark.
arXiv Detail & Related papers (2025-05-09T03:38:31Z) - Direct Retrieval-augmented Optimization: Synergizing Knowledge Selection and Language Models [83.8639566087953]
We propose a direct retrieval-augmented optimization framework, named DRO, that enables end-to-end training of two key components.<n>DRO alternates between two phases: (i) document permutation estimation and (ii) re-weighted, progressively improving RAG components.<n>Our theoretical analysis reveals that DRO is analogous to policy-gradient methods in reinforcement learning.
arXiv Detail & Related papers (2025-05-05T23:54:53Z) - LeanProgress: Guiding Search for Neural Theorem Proving via Proof Progress Prediction [74.79306773878955]
We introduce LeanProgress, a method that predicts the progress in the proof.<n>Our experiments show that LeanProgress achieves an overall prediction accuracy of 75.1%.
arXiv Detail & Related papers (2025-02-25T07:46:36Z) - STP: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving [33.61458249318183]
Self-play Theorem Prover (STP) takes on two roles, conjecturer and prover.<n>STP simultaneously takes on two roles, conjecturer and prover.<n>We evaluate with both Lean and Isabelle formal versifiers.
arXiv Detail & Related papers (2025-01-31T23:01:48Z) - Efficient Neural Theorem Proving via Fine-grained Proof Structure Analysis [50.020850767257095]
We propose ProofAug, a novel theorem proving method that enjoys superior sample efficiency.<n>Our method is validated on the miniF2F-test benchmark using the open-source deepseek-math-7b-base model and the Isabelle proof assistant.
arXiv Detail & Related papers (2025-01-30T12:37:06Z) - Lean Copilot: Large Language Models as Copilots for Theorem Proving in Lean [81.94024084598598]
We introduce Lean Copilot, a framework for running LLM inference in Lean.<n>We build tools that suggest proof steps, complete proof goals, and select relevant premises.<n>When assisting humans, Lean Copilot requires only 2.08 manually-entered proof steps on average (3.86 required by aesop)<n>When automating the theorem proving process, Lean Copilot 74.2% proof steps on average, 85% better than aesop (40.1%).
arXiv Detail & Related papers (2024-04-18T22:54:08Z) - MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data [85.50740598523818]
MUSTARD is a framework that masters uniform synthesis of theorem and proof data of high quality and diversity.
We present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points.
We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data.
arXiv Detail & Related papers (2024-02-14T05:57:58Z) - Enhancing Neural Theorem Proving through Data Augmentation and Dynamic
Sampling Method [1.8130068086063336]
We introduce DS-Prover, a novel dynamic sampling method for theorem proving.
We augment the training dataset by decomposing simplification and rewrite tactics with multiple premises into tactics with single premises.
We achieve a state-of-the-art performance (Pass@1) of 14.2% on the ProofNet dataset and a performance of 29.8% on MiniF2F.
arXiv Detail & Related papers (2023-12-20T09:55:21Z) - LeanDojo: Theorem Proving with Retrieval-Augmented Language Models [72.54339382005732]
Large language models (LLMs) have shown promise in proving formal theorems using proof assistants such as Lean.
Existing methods are difficult to reproduce or build on, due to private code, data, and compute requirements.
This paper introduces LeanDojo: an open-source Lean toolkit consisting of toolkits, data, models.
We develop ReProver: an LLM-based prover augmented with retrieval for selecting premises from a vast math library.
arXiv Detail & Related papers (2023-06-27T17:05:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.