Related papers: Proof Automation with Large Language Models

Proof Automation with Large Language Models

URL: http://arxiv.org/abs/2409.14274v1
Date: Sun, 22 Sep 2024 00:19:27 GMT
Title: Proof Automation with Large Language Models
Authors: Minghai Lu, Benjamin Delaware, Tianyi Zhang,
Abstract summary: Large Language Models (LLMs) have shown promise in automatically generating informal proofs in natural language. We propose PALM, a novel generate-then-repair approach that first prompts an LLM to generate an initial proof and then leverages targeted symbolic methods to iteratively repair low-level problems. Our results show that PALM significantly outperforms other state-of-the-art approaches, successfully proving 76.6% to 180.4% more theorems.
Score: 6.587933406842906
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Interactive theorem provers such as Coq are powerful tools to formally guarantee the correctness of software. However, using these tools requires significant manual effort and expertise. While Large Language Models (LLMs) have shown promise in automatically generating informal proofs in natural language, they are less effective at generating formal proofs in interactive theorem provers. In this paper, we conduct a formative study to identify common mistakes made by LLMs when asked to generate formal proofs. By analyzing 520 proof generation errors made by GPT-3.5, we found that GPT-3.5 often identified the correct high-level structure of a proof, but struggled to get the lower-level details correct. Based on this insight, we propose PALM, a novel generate-then-repair approach that first prompts an LLM to generate an initial proof and then leverages targeted symbolic methods to iteratively repair low-level problems. We evaluate PALM on a large dataset that includes more than 10K theorems. Our results show that PALM significantly outperforms other state-of-the-art approaches, successfully proving 76.6% to 180.4% more theorems. Moreover, PALM proves 1270 theorems beyond the reach of existing approaches. We also demonstrate the generalizability of PALM across different LLMs.

Related papers

DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning [67.93945726549289]
DeepTheorem is a comprehensive informal theorem-proving framework exploiting natural language to enhance mathematical reasoning.<n>DeepTheorem includes a large-scale benchmark dataset consisting of 121K high-quality IMO-level informal theorems and proofs.<n>We devise a novel reinforcement learning strategy (RL-Zero) explicitly tailored to informal theorem proving, leveraging the verified theorem variants to incentivize robust mathematical inference.
arXiv Detail & Related papers (2025-05-29T17:59:39Z)
Towards Reliable Proof Generation with LLMs: A Neuro-Symbolic Approach [14.213719696233934]
Large language models (LLMs) struggle with formal domains that require rigorous logical deduction and symbolic reasoning.<n>We propose a neuro-symbolic approach that combines LLMs' generative strengths with structured components to overcome this challenge.
arXiv Detail & Related papers (2025-05-20T15:13:32Z)
APOLLO: Automated LLM and Lean Collaboration for Advanced Formal Reasoning [8.056359341994941]
APOLLO is a modelagnostic pipeline that combines the strengths of the Lean compiler with an LLM's reasoning abilities.<n>A set of agents analyze the proofs, fix the syntax errors, identify the mistakes in the proofs using Lean.<n>The repaired subproofs are recombined and reverified, iterating up to a usercontrolled maximum number of attempts.
arXiv Detail & Related papers (2025-05-09T03:38:31Z)
Hierarchical Attention Generates Better Proofs [8.676187819105298]
We introduce textbfHierarchical Attention, a regularization method that aligns attention mechanisms with mathematical reasoning structures. Our approach establishes a five-level hierarchy from foundational elements to high-level concepts, ensuring structured information flow in proof generation.
arXiv Detail & Related papers (2025-04-27T10:35:05Z)
LeanProgress: Guiding Search for Neural Theorem Proving via Proof Progress Prediction [74.79306773878955]
We introduce LeanProgress, a method that predicts the progress in the proof. Our experiments show that LeanProgress achieves an overall prediction accuracy of 75.1%.
arXiv Detail & Related papers (2025-02-25T07:46:36Z)
Generating Millions Of Lean Theorems With Proofs By Exploring State Transition Graphs [6.65877320351217]
We developed LeanNavigator, a novel method for generating a large-scale dataset of Lean theorems and proofs. We generated 4.7 million theorems totaling 1 billion tokens, surpassing previous datasets by more than an order of magnitude. Using this extensive dataset, we trained an AI model that outperforms the state-of-the-art ReProver model in theorem-proving tasks.
arXiv Detail & Related papers (2025-02-16T06:20:39Z)
Efficient Neural Theorem Proving via Fine-grained Proof Structure Analysis [50.020850767257095]
We propose ProofAug, a novel theorem proving method that enjoys superior sample efficiency. Our method is validated on the miniF2F-test benchmark using the open-source deepseek-math-7b-base model and the Isabelle proof assistant.
arXiv Detail & Related papers (2025-01-30T12:37:06Z)
Formal Theorem Proving by Rewarding LLMs to Decompose Proofs Hierarchically [29.908878832382523]
This paper focuses on improving LLMs' ability to write proofs in formal languages that permit automated proof verification/evaluation. We work in a more natural setup where the lemmas that are directly relevant to the theorem are not given to the theorem prover at test time. We design an RL-based training algorithm that encourages the model to decompose a theorem into lemmas, prove the lemmas, and then prove the theorem by using the lemmas.
arXiv Detail & Related papers (2024-11-04T05:57:40Z)
Lean-STaR: Learning to Interleave Thinking and Proving [53.923617816215774]
We present Lean-STaR, a framework for training language models to produce informal thoughts prior to each step of a proof. Lean-STaR achieves state-of-the-art results on the miniF2F-test benchmark within the Lean theorem proving environment.
arXiv Detail & Related papers (2024-07-14T01:43:07Z)
TheoremLlama: Transforming General-Purpose LLMs into Lean4 Experts [26.98890165420689]
TheoremLlama is an end-to-end framework that trains a general-purpose Lean4 expert. Our framework achieves cumulative accuracies of 36.48% and 33.61% on MiniF2F-Valid and Test datasets respectively.
arXiv Detail & Related papers (2024-07-03T15:36:18Z)
Proving Theorems Recursively [80.42431358105482]
We propose POETRY, which proves theorems in a level-by-level manner. Unlike previous step-by-step methods, POETRY searches for a sketch of the proof at each level. We observe a substantial increase in the maximum proof length found by POETRY, from 10 to 26.
arXiv Detail & Related papers (2024-05-23T10:35:08Z)
DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data [65.5290035371111]
We introduce an approach to generate extensive Lean 4 proof data derived from high-school and undergraduate-level mathematical competition problems. We fine-tune the DeepSeekMath 7B model on this synthetic dataset, which comprises 8 million formal statements with proofs. Our model successfully proved 5 out of 148 problems in the Lean 4 Formalized International Mathematical Olympiad (FIMO) benchmark, while GPT-4 failed to prove any.
arXiv Detail & Related papers (2024-05-23T09:03:42Z)
Towards Large Language Models as Copilots for Theorem Proving in Lean [81.94024084598598]
We introduce Lean Copilot, a framework for running Lean inference in large language models. We build tools for suggesting proof steps, completing intermediate proof goals, and selecting relevant premises. Experimental results demonstrate the effectiveness of our method in assisting humans and theorem proving process.
arXiv Detail & Related papers (2024-04-18T22:54:08Z)
MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data [85.50740598523818]
MUSTARD is a framework that masters uniform synthesis of theorem and proof data of high quality and diversity. We present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points. We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data.
arXiv Detail & Related papers (2024-02-14T05:57:58Z)
Generating Natural Language Proofs with Verifier-Guided Search [74.9614610172561]
We present a novel stepwise method NLProofS (Natural Language Proof Search) NLProofS learns to generate relevant steps conditioning on the hypothesis. It achieves state-of-the-art performance on EntailmentBank and RuleTaker.
arXiv Detail & Related papers (2022-05-25T02:22:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.