Baldur: Whole-Proof Generation and Repair with Large Language Models
- URL: http://arxiv.org/abs/2303.04910v1
- Date: Wed, 8 Mar 2023 22:00:15 GMT
- Title: Baldur: Whole-Proof Generation and Repair with Large Language Models
- Authors: Emily First, Markus N. Rabe, Talia Ringer, Yuriy Brun
- Abstract summary: We use large language models, trained on natural language text and code and fine-tuned on proofs, to generate whole proofs for theorems at once.
We combine this proof generation model with a fine-tuned repair model to repair generated proofs, further increasing proving power.
We evaluate our method in a prototype, Baldur, and evaluate it on a benchmark of 6,336 Isabelle/HOL theorems and their proofs.
- Score: 8.100054850290507
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Formally verifying software properties is a highly desirable but
labor-intensive task. Recent work has developed methods to automate formal
verification using proof assistants, such as Coq and Isabelle/HOL, e.g., by
training a model to predict one proof step at a time, and using that model to
search through the space of possible proofs. This paper introduces a new method
to automate formal verification: We use large language models, trained on
natural language text and code and fine-tuned on proofs, to generate whole
proofs for theorems at once, rather than one step at a time. We combine this
proof generation model with a fine-tuned repair model to repair generated
proofs, further increasing proving power. As its main contributions, this paper
demonstrates for the first time that: (1) Whole-proof generation using
transformers is possible and is as effective as search-based techniques without
requiring costly search. (2) Giving the learned model additional context, such
as a prior failed proof attempt and the ensuing error message, results in proof
repair and further improves automated proof generation. (3) We establish a new
state of the art for fully automated proof synthesis. We reify our method in a
prototype, Baldur, and evaluate it on a benchmark of 6,336 Isabelle/HOL
theorems and their proofs. In addition to empirically showing the effectiveness
of whole-proof generation, repair, and added context, we show that Baldur
improves on the state-of-the-art tool, Thor, by automatically generating proofs
for an additional 8.7% of the theorems. Together, Baldur and Thor can prove
65.7% of the theorems fully automatically. This paper paves the way for new
research into using large language models for automating formal verification.
Related papers
- Cobblestone: Iterative Automation for Formal Verification [11.445689801392657]
Formal verification using proof assistants, such as Coq, is an effective way of improving software quality, but it is expensive.
Recent research has used machine learning to automatically synthesize proofs, reducing verification effort, but these tools are able to prove only a fraction of the desired software properties.
We introduce Cobblestone, a new proof-synthesis approach that improves on the state of the art by taking advantage of partial progress in proof synthesis attempts.
arXiv Detail & Related papers (2024-10-25T19:25:00Z) - ImProver: Agent-Based Automated Proof Optimization [18.315243539816464]
We present ImProver, a large-language-model agent that rewrites proofs to optimize arbitrary user-defined metrics in Lean.
We test ImProver on rewriting real-world undergraduate, competition, and research-level mathematics theorems.
arXiv Detail & Related papers (2024-10-07T05:14:18Z) - Lean-STaR: Learning to Interleave Thinking and Proving [53.923617816215774]
We present Lean-STaR, a framework for training language models to produce informal thoughts prior to each step of a proof.
Lean-STaR achieves state-of-the-art results on the miniF2F-test benchmark within the Lean theorem proving environment.
arXiv Detail & Related papers (2024-07-14T01:43:07Z) - MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data [85.50740598523818]
MUSTARD is a framework that masters uniform synthesis of theorem and proof data of high quality and diversity.
We present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points.
We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data.
arXiv Detail & Related papers (2024-02-14T05:57:58Z) - Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal
Proofs [30.57062828812679]
We introduce Draft, Sketch, and Prove (DSP), a method that maps informal proofs to formal proof sketches.
We show that large language models are able to produce well-structured formal sketches that follow the same reasoning steps as the informal proofs.
arXiv Detail & Related papers (2022-10-21T22:37:22Z) - NaturalProver: Grounded Mathematical Proof Generation with Language
Models [84.2064569475095]
Theorem proving in natural mathematical language plays a central role in mathematical advances and education.
We develop NaturalProver, a language model that generates proofs by conditioning on background references.
NaturalProver is capable of proving some theorems that require short (2-6 step) proofs, and providing next-step suggestions that are rated as correct and useful over 40% of the time.
arXiv Detail & Related papers (2022-05-25T17:01:18Z) - Generating Natural Language Proofs with Verifier-Guided Search [74.9614610172561]
We present a novel stepwise method NLProofS (Natural Language Proof Search)
NLProofS learns to generate relevant steps conditioning on the hypothesis.
It achieves state-of-the-art performance on EntailmentBank and RuleTaker.
arXiv Detail & Related papers (2022-05-25T02:22:30Z) - PRover: Proof Generation for Interpretable Reasoning over Rules [81.40404921232192]
We propose a transformer-based model that answers binary questions over rule-bases and generates the corresponding proofs.
Our model learns to predict nodes and edges corresponding to proof graphs in an efficient constrained training paradigm.
We conduct experiments on synthetic, hand-authored, and human-paraphrased rule-bases to show promising results for QA and proof generation.
arXiv Detail & Related papers (2020-10-06T15:47:53Z) - Generative Language Modeling for Automated Theorem Proving [94.01137612934842]
This work is motivated by the possibility that a major limitation of automated theorem provers compared to humans might be addressable via generation from language models.
We present an automated prover and proof assistant, GPT-f, for the Metamath formalization language, and analyze its performance.
arXiv Detail & Related papers (2020-09-07T19:50:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.