Related papers: InternLM2.5-StepProver: Advancing Automated Theorem Proving via Expert Iteration on Large-Scale LEAN Problems

InternLM2.5-StepProver: Advancing Automated Theorem Proving via Expert Iteration on Large-Scale LEAN Problems

URL: http://arxiv.org/abs/2410.15700v1
Date: Mon, 21 Oct 2024 07:18:23 GMT
Title: InternLM2.5-StepProver: Advancing Automated Theorem Proving via Expert Iteration on Large-Scale LEAN Problems
Authors: Zijian Wu, Suozhi Huang, Zhejian Zhou, Huaiyuan Ying, Jiayu Wang, Dahua Lin, Kai Chen,
Abstract summary: InternLM2.5-Steper achieves open-source state-of-the-art on MiniF2F, Lean-Workbook-Plus, ProofNet, and Putnam benchmarks. It achieves a pass of 65.9% on the MiniF2F-test and proves (or disproves) 17.0% of problems in Lean-Workbook-Plus.
Score: 47.93470713879515
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have emerged as powerful tools in mathematical theorem proving, particularly when utilizing formal languages such as LEAN. The major learning paradigm is expert iteration, which necessitates a pre-defined dataset comprising numerous mathematical problems. In this process, LLMs attempt to prove problems within the dataset and iteratively refine their capabilities through self-training on the proofs they discover. We propose to use large scale LEAN problem datasets Lean-workbook for expert iteration with more than 20,000 CPU days. During expert iteration, we found log-linear trends between solved problem amount with proof length and CPU usage. We train a critic model to select relatively easy problems for policy models to make trials and guide the model to search for deeper proofs. InternLM2.5-StepProver achieves open-source state-of-the-art on MiniF2F, Lean-Workbook-Plus, ProofNet, and Putnam benchmarks. Specifically, it achieves a pass of 65.9% on the MiniF2F-test and proves (or disproves) 17.0% of problems in Lean-Workbook-Plus which shows a significant improvement compared to only 9.5% of problems proved when Lean-Workbook-Plus was released. We open-source our models and searched proofs at https://github.com/InternLM/InternLM-Math and https://huggingface.co/datasets/internlm/Lean-Workbook.

Related papers

Generating Millions Of Lean Theorems With Proofs By Exploring State Transition Graphs [6.65877320351217]
We developed LeanNavigator, a novel method for generating a large-scale dataset of Lean theorems and proofs. We generated 4.7 million theorems totaling 1 billion tokens, surpassing previous datasets by more than an order of magnitude. Using this extensive dataset, we trained an AI model that outperforms the state-of-the-art ReProver model in theorem-proving tasks.
arXiv Detail & Related papers (2025-02-16T06:20:39Z)
Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving [72.8626512877667]
We introduce Goedel-Prover, an open-source language model that achieves state-of-the-art (as of April 5 2025) performance in automated formal proof generation for mathematical problems. First, we train LLMs to convert natural language math problems from the Numina dataset to equivalent formal statements in Lean 4. Next, we develop a large dataset of formal proofs by training a series of provers. Finally, we obtain the dataset Goedel-Pset-v1-solved, which contains proofs for over 800K statements from Goedel-Pset-v1
arXiv Detail & Related papers (2025-02-11T15:27:35Z)
STP: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving [33.61458249318183]
Self-play Theorem Prover (STP) takes on two roles, conjecturer and prover. STP simultaneously takes on two roles, conjecturer and prover. We evaluate with both Lean and Isabelle formal versifiers.
arXiv Detail & Related papers (2025-01-31T23:01:48Z)
Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation [55.21013307734612]
AoPS-Instruct is a dataset of more than 600,000 high-quality QA pairs. LiveAoPSBench is an evolving evaluation set with timestamps, derived from the latest forum data. Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning.
arXiv Detail & Related papers (2025-01-24T06:39:38Z)
A Lean Dataset for International Math Olympiad: Small Steps towards Writing Math Proofs for Hard Problems [0.0]
We write complete, original formal proofs for the remaining IMO problems in Lean. This effort expands the availability of proof currently in the public domain by creating 5,880 lines of Lean proof.
arXiv Detail & Related papers (2024-11-28T02:50:42Z)
One Thousand and One Pairs: A "novel" challenge for long-context language models [56.60667988954638]
NoCha is a dataset of 1,001 pairs of true and false claims about 67 fictional books. Our annotators confirm that the largest share of pairs in NoCha require global reasoning over the entire book to verify. On average, models perform much better on pairs that require only sentence-level retrieval vs. global reasoning.
arXiv Detail & Related papers (2024-06-24T02:03:57Z)
Lean Workbook: A large-scale Lean problem set formalized from natural language math problems [50.22847430754973]
Large language models are not good at math theorem proving using formal languages like Lean. A significant challenge in this area is the scarcity of training data available in these formal languages. We propose a novel pipeline that iteratively generates and filters synthetic data to translate natural language mathematical problems into Lean 4 statements.
arXiv Detail & Related papers (2024-06-06T08:25:43Z)
Towards Large Language Models as Copilots for Theorem Proving in Lean [81.94024084598598]
We introduce Lean Copilot, a framework for running Lean inference in large language models. We build tools for suggesting proof steps, completing intermediate proof goals, and selecting relevant premises. Experimental results demonstrate the effectiveness of our method in assisting humans and theorem proving process.
arXiv Detail & Related papers (2024-04-18T22:54:08Z)
MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data [85.50740598523818]
MUSTARD is a framework that masters uniform synthesis of theorem and proof data of high quality and diversity. We present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points. We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data.
arXiv Detail & Related papers (2024-02-14T05:57:58Z)
Enhancing Neural Theorem Proving through Data Augmentation and Dynamic Sampling Method [1.8130068086063336]
We introduce DS-Prover, a novel dynamic sampling method for theorem proving. We augment the training dataset by decomposing simplification and rewrite tactics with multiple premises into tactics with single premises. We achieve a state-of-the-art performance (Pass@1) of 14.2% on the ProofNet dataset and a performance of 29.8% on MiniF2F.
arXiv Detail & Related papers (2023-12-20T09:55:21Z)
LeanDojo: Theorem Proving with Retrieval-Augmented Language Models [72.54339382005732]
Large language models (LLMs) have shown promise in proving formal theorems using proof assistants such as Lean. Existing methods are difficult to reproduce or build on, due to private code, data, and compute requirements. This paper introduces LeanDojo: an open-source Lean toolkit consisting of toolkits, data, models. We develop ReProver: an LLM-based prover augmented with retrieval for selecting premises from a vast math library.
arXiv Detail & Related papers (2023-06-27T17:05:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.