Related papers: Towards Large Language Models as Copilots for Theorem Proving in Lean

Towards Large Language Models as Copilots for Theorem Proving in Lean

URL: http://arxiv.org/abs/2404.12534v1
Date: Thu, 18 Apr 2024 22:54:08 GMT
Title: Towards Large Language Models as Copilots for Theorem Proving in Lean
Authors: Peiyang Song, Kaiyu Yang, Anima Anandkumar,
Abstract summary: We introduce Lean Copilot, a framework for running Lean inference in large language models. We build tools for suggesting proof steps, completing intermediate proof goals, and selecting relevant premises. Experimental results demonstrate the effectiveness of our method in assisting humans and theorem proving process.
Score: 81.94024084598598
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Theorem proving is an important challenge for large language models (LLMs), as formal proofs can be checked rigorously by proof assistants such as Lean, leaving no room for hallucination. Existing LLM-based provers try to prove theorems in a fully autonomous mode without human intervention. In this mode, they struggle with novel and challenging theorems, for which human insights may be critical. In this paper, we explore LLMs as copilots that assist humans in proving theorems. We introduce Lean Copilot, a framework for running LLM inference in Lean. It enables programmers to build various LLM-based proof automation tools that integrate seamlessly into the workflow of Lean users. Using Lean Copilot, we build tools for suggesting proof steps (tactic suggestion), completing intermediate proof goals (proof search), and selecting relevant premises (premise selection) using LLMs. Users can use our pretrained models or bring their own ones that run either locally (with or without GPUs) or on the cloud. Experimental results demonstrate the effectiveness of our method in assisting humans and automating theorem proving process compared to existing rule-based proof automation in Lean. We open source all codes under a permissive MIT license to facilitate further research.

Related papers

HybridProver: Augmenting Theorem Proving with LLM-Driven Proof Synthesis and Refinement [7.702809989052384]
HybridProver is a dual-model proof framework that combines tactic-based generation and whole-proof synthesis.<n>We implement HybridProver for the Isabelle theorem prover and fine-tune LLMs on our optimized datasets.
arXiv Detail & Related papers (2025-05-21T16:45:43Z)
APOLLO: Automated LLM and Lean Collaboration for Advanced Formal Reasoning [8.056359341994941]
APOLLO is a modelagnostic pipeline that combines the strengths of the Lean compiler with an LLM's reasoning abilities.<n>A set of agents analyze the proofs, fix the syntax errors, identify the mistakes in the proofs using Lean.<n>The repaired subproofs are recombined and reverified, iterating up to a usercontrolled maximum number of attempts.
arXiv Detail & Related papers (2025-05-09T03:38:31Z)
Neural Theorem Proving: Generating and Structuring Proofs for Formal Verification [0.4779196219827508]
We introduce a framework that generates whole proofs in a formal language to be used within systems that utilize the power of built-in tactics and off-the-shelf automated theorem provers. To train the LLM, we employ a 2-stage finetuning process, where we first use SFT-based training to enable the model to generate syntactically correct Isabelle code. We validate our framework using the miniF2F-test benchmark and the Isabelle proof assistant and design a use case to verify the correctness of the S3 bucket access policy code.
arXiv Detail & Related papers (2025-04-23T18:04:38Z)
LeanProgress: Guiding Search for Neural Theorem Proving via Proof Progress Prediction [74.79306773878955]
We introduce LeanProgress, a method that predicts the progress in the proof. Our experiments show that LeanProgress achieves an overall prediction accuracy of 75.1%.
arXiv Detail & Related papers (2025-02-25T07:46:36Z)
Generating Millions Of Lean Theorems With Proofs By Exploring State Transition Graphs [6.65877320351217]
We developed LeanNavigator, a novel method for generating a large-scale dataset of Lean theorems and proofs. We generated 4.7 million theorems totaling 1 billion tokens, surpassing previous datasets by more than an order of magnitude. Using this extensive dataset, we trained an AI model that outperforms the state-of-the-art ReProver model in theorem-proving tasks.
arXiv Detail & Related papers (2025-02-16T06:20:39Z)
Efficient Neural Theorem Proving via Fine-grained Proof Structure Analysis [50.020850767257095]
We propose ProofAug, a novel theorem proving method that enjoys superior sample efficiency. Our method is validated on the miniF2F-test benchmark using the open-source deepseek-math-7b-base model and the Isabelle proof assistant.
arXiv Detail & Related papers (2025-01-30T12:37:06Z)
InternLM2.5-StepProver: Advancing Automated Theorem Proving via Expert Iteration on Large-Scale LEAN Problems [47.93470713879515]
InternLM2.5-Steper achieves open-source state-of-the-art on MiniF2F, Lean-Workbook-Plus, ProofNet, and Putnam benchmarks. It achieves a pass of 65.9% on the MiniF2F-test and proves (or disproves) 17.0% of problems in Lean-Workbook-Plus.
arXiv Detail & Related papers (2024-10-21T07:18:23Z)
zsLLMCode: An Effective Approach for Functional Code Embedding via LLM with Zero-Shot Learning [6.976968804436321]
Large language models (LLMs) have the capability of zero-shot learning, which does not require training or fine-tuning. We propose zsLLMCode, a novel approach that generates functional code embeddings using LLMs.
arXiv Detail & Related papers (2024-09-23T01:03:15Z)
Proof Automation with Large Language Models [6.587933406842906]
Large Language Models (LLMs) have shown promise in automatically generating informal proofs in natural language. We propose PALM, a novel generate-then-repair approach that first prompts an LLM to generate an initial proof and then leverages targeted symbolic methods to iteratively repair low-level problems. Our results show that PALM significantly outperforms other state-of-the-art approaches, successfully proving 76.6% to 180.4% more theorems.
arXiv Detail & Related papers (2024-09-22T00:19:27Z)
Lean-STaR: Learning to Interleave Thinking and Proving [53.923617816215774]
We present Lean-STaR, a framework for training language models to produce informal thoughts prior to each step of a proof. Lean-STaR achieves state-of-the-art results on the miniF2F-test benchmark within the Lean theorem proving environment.
arXiv Detail & Related papers (2024-07-14T01:43:07Z)
TheoremLlama: Transforming General-Purpose LLMs into Lean4 Experts [26.98890165420689]
TheoremLlama is an end-to-end framework that trains a general-purpose Lean4 expert. Our framework achieves cumulative accuracies of 36.48% and 33.61% on MiniF2F-Valid and Test datasets respectively.
arXiv Detail & Related papers (2024-07-03T15:36:18Z)
Efficient Tool Use with Chain-of-Abstraction Reasoning [65.18096363216574]
Large language models (LLMs) need to ground their reasoning to real-world knowledge. There remains challenges for fine-tuning LLM agents to invoke tools in multi-step reasoning problems. We propose a new method for LLMs to better leverage tools in multi-step reasoning.
arXiv Detail & Related papers (2024-01-30T21:53:30Z)
Enhancing Neural Theorem Proving through Data Augmentation and Dynamic Sampling Method [1.8130068086063336]
We introduce DS-Prover, a novel dynamic sampling method for theorem proving. We augment the training dataset by decomposing simplification and rewrite tactics with multiple premises into tactics with single premises. We achieve a state-of-the-art performance (Pass@1) of 14.2% on the ProofNet dataset and a performance of 29.8% on MiniF2F.
arXiv Detail & Related papers (2023-12-20T09:55:21Z)
Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks. How do we evaluate the capabilities of LLMs to consistently produce factually correct answers? We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z)
LeanDojo: Theorem Proving with Retrieval-Augmented Language Models [72.54339382005732]
Large language models (LLMs) have shown promise in proving formal theorems using proof assistants such as Lean. Existing methods are difficult to reproduce or build on, due to private code, data, and compute requirements. This paper introduces LeanDojo: an open-source Lean toolkit consisting of toolkits, data, models. We develop ReProver: an LLM-based prover augmented with retrieval for selecting premises from a vast math library.
arXiv Detail & Related papers (2023-06-27T17:05:32Z)
Self-Checker: Plug-and-Play Modules for Fact-Checking with Large Language Models [75.75038268227554]
Self-Checker is a framework comprising a set of plug-and-play modules that facilitate fact-checking. This framework provides a fast and efficient way to construct fact-checking systems in low-resource environments.
arXiv Detail & Related papers (2023-05-24T01:46:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.