Baldur: Whole-Proof Generation and Repair with Large Language Models
- URL: http://arxiv.org/abs/2303.04910v1
- Date: Wed, 8 Mar 2023 22:00:15 GMT
- Title: Baldur: Whole-Proof Generation and Repair with Large Language Models
- Authors: Emily First, Markus N. Rabe, Talia Ringer, Yuriy Brun
- Abstract summary: We use large language models, trained on natural language text and code and fine-tuned on proofs, to generate whole proofs for theorems at once.
We combine this proof generation model with a fine-tuned repair model to repair generated proofs, further increasing proving power.
We evaluate our method in a prototype, Baldur, and evaluate it on a benchmark of 6,336 Isabelle/HOL theorems and their proofs.
- Score: 8.100054850290507
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Formally verifying software properties is a highly desirable but
labor-intensive task. Recent work has developed methods to automate formal
verification using proof assistants, such as Coq and Isabelle/HOL, e.g., by
training a model to predict one proof step at a time, and using that model to
search through the space of possible proofs. This paper introduces a new method
to automate formal verification: We use large language models, trained on
natural language text and code and fine-tuned on proofs, to generate whole
proofs for theorems at once, rather than one step at a time. We combine this
proof generation model with a fine-tuned repair model to repair generated
proofs, further increasing proving power. As its main contributions, this paper
demonstrates for the first time that: (1) Whole-proof generation using
transformers is possible and is as effective as search-based techniques without
requiring costly search. (2) Giving the learned model additional context, such
as a prior failed proof attempt and the ensuing error message, results in proof
repair and further improves automated proof generation. (3) We establish a new
state of the art for fully automated proof synthesis. We reify our method in a
prototype, Baldur, and evaluate it on a benchmark of 6,336 Isabelle/HOL
theorems and their proofs. In addition to empirically showing the effectiveness
of whole-proof generation, repair, and added context, we show that Baldur
improves on the state-of-the-art tool, Thor, by automatically generating proofs
for an additional 8.7% of the theorems. Together, Baldur and Thor can prove
65.7% of the theorems fully automatically. This paper paves the way for new
research into using large language models for automating formal verification.
Related papers
- Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving [72.8626512877667]
We introduce Goedel-Prover, an open-source large language model (LLM) that achieves the state-of-the-art (SOTA) performance in automated formal proof generation for mathematical problems.
arXiv Detail & Related papers (2025-02-11T15:27:35Z) - Efficient Neural Theorem Proving via Fine-grained Proof Structure Analysis [50.020850767257095]
We propose ProofAug, a novel theorem proving method that enjoys superior sample efficiency.
Our method is validated on the miniF2F-test benchmark using the open-source deepseek-math-7b-base model and the Isabelle proof assistant.
arXiv Detail & Related papers (2025-01-30T12:37:06Z) - ImProver: Agent-Based Automated Proof Optimization [18.315243539816464]
We present ImProver, a large-language-model agent that rewrites proofs to optimize arbitrary user-defined metrics in Lean.
We test ImProver on rewriting real-world undergraduate, competition, and research-level mathematics theorems.
arXiv Detail & Related papers (2024-10-07T05:14:18Z) - Lean-STaR: Learning to Interleave Thinking and Proving [53.923617816215774]
We present Lean-STaR, a framework for training language models to produce informal thoughts prior to each step of a proof.
Lean-STaR achieves state-of-the-art results on the miniF2F-test benchmark within the Lean theorem proving environment.
arXiv Detail & Related papers (2024-07-14T01:43:07Z) - MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data [85.50740598523818]
MUSTARD is a framework that masters uniform synthesis of theorem and proof data of high quality and diversity.
We present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points.
We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data.
arXiv Detail & Related papers (2024-02-14T05:57:58Z) - Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal
Proofs [30.57062828812679]
We introduce Draft, Sketch, and Prove (DSP), a method that maps informal proofs to formal proof sketches.
We show that large language models are able to produce well-structured formal sketches that follow the same reasoning steps as the informal proofs.
arXiv Detail & Related papers (2022-10-21T22:37:22Z) - NaturalProver: Grounded Mathematical Proof Generation with Language
Models [84.2064569475095]
Theorem proving in natural mathematical language plays a central role in mathematical advances and education.
We develop NaturalProver, a language model that generates proofs by conditioning on background references.
NaturalProver is capable of proving some theorems that require short (2-6 step) proofs, and providing next-step suggestions that are rated as correct and useful over 40% of the time.
arXiv Detail & Related papers (2022-05-25T17:01:18Z) - Generating Natural Language Proofs with Verifier-Guided Search [74.9614610172561]
We present a novel stepwise method NLProofS (Natural Language Proof Search)
NLProofS learns to generate relevant steps conditioning on the hypothesis.
It achieves state-of-the-art performance on EntailmentBank and RuleTaker.
arXiv Detail & Related papers (2022-05-25T02:22:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.