Related papers: REAL-Prover: Retrieval Augmented Lean Prover for Mathematical Reasoning

REAL-Prover: Retrieval Augmented Lean Prover for Mathematical Reasoning

URL: http://arxiv.org/abs/2505.20613v2
Date: Tue, 17 Jun 2025 03:25:43 GMT
Title: REAL-Prover: Retrieval Augmented Lean Prover for Mathematical Reasoning
Authors: Ziju Shen, Naohao Huang, Fanyi Yang, Yutong Wang, Guoxiong Gao, Tianyi Xu, Jiedong Jiang, Wanyi He, Pu Yang, Mengzhou Sun, Haocheng Ju, Peihao Wu, Bryan Dai, Bin Dong,
Abstract summary: We present REAL-Prover, a new open-source stepwise theorem prover for Lean 4.<n>Our prover notably boosts performance on solving college-level mathematics problems.<n>In experiments, our prover using only supervised fine-tune theorem achieves competitive results with a 23.7% success rate.
Score: 12.343823629674368
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Nowadays, formal theorem provers have made monumental progress on high-school and competition-level mathematics, but few of them generalize to more advanced mathematics. In this paper, we present REAL-Prover, a new open-source stepwise theorem prover for Lean 4 to push this boundary. This prover, based on our fine-tuned large language model (REAL-Prover-v1) and integrated with a retrieval system (Leansearch-PS), notably boosts performance on solving college-level mathematics problems. To train REAL-Prover-v1, we developed HERALD-AF, a data extraction pipeline that converts natural language math problems into formal statements, and a new open-source Lean 4 interactive environment (Jixia-interactive) to facilitate synthesis data collection. In our experiments, our prover using only supervised fine-tune achieves competitive results with a 23.7% success rate (Pass@64) on the ProofNet dataset-comparable to state-of-the-art (SOTA) models. To further evaluate our approach, we introduce FATE-M, a new benchmark focused on algebraic problems, where our prover achieves a SOTA success rate of 56.7% (Pass@64).

Related papers

Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving [72.8626512877667]
We introduce Goedel-Prover, an open-source language model that achieves state-of-the-art (as of April 5 2025) performance in automated formal proof generation for mathematical problems.<n>First, we train LLMs to convert natural language math problems from the Numina dataset to equivalent formal statements in Lean 4.<n>Next, we develop a large dataset of formal proofs by training a series of provers.<n>Finally, we obtain the dataset Goedel-Pset-v1-solved, which contains proofs for over 800K statements from Goedel-Pset-v1
arXiv Detail & Related papers (2025-02-11T15:27:35Z)
Chain-of-Reasoning: Towards Unified Mathematical Reasoning in Large Language Models via a Multi-Paradigm Perspective [82.9413277326097]
Chain-of-Reasoning (CoR) is a novel unified framework integrating multiple reasoning paradigms.<n>CoR generates multiple potential answers via different reasoning paradigms and synthesizes them into a coherent final solution.<n> Experimental results demonstrate that CoR-Math-7B significantly outperforms current SOTA models.
arXiv Detail & Related papers (2025-01-19T16:53:26Z)
Alchemy: Amplifying Theorem-Proving Capability through Symbolic Mutation [71.32761934724867]
This work proposes Alchemy, a framework for data synthesis that constructs formal theorems through symbolic mutation.<n>For each candidate theorem in Mathlib, we identify all invocable theorems that can be used to rewrite or apply to it.<n>As a result, our method increases the number of theorems in Mathlib by an order of magnitude, from 110k to 6M.
arXiv Detail & Related papers (2024-10-21T08:04:21Z)
Teaching-Inspired Integrated Prompting Framework: A Novel Approach for Enhancing Reasoning in Large Language Models [8.370453544530914]
Large Language Models (LLMs) exhibit impressive performance across various domains but still struggle with arithmetic reasoning tasks. Recent work shows the effectiveness of prompt design methods in enhancing reasoning capabilities. We propose a novel and effective Teaching-Inspired Integrated Framework, which emulates the instructional process of a teacher guiding students.
arXiv Detail & Related papers (2024-10-10T16:02:36Z)
LEAN-GitHub: Compiling GitHub LEAN repositories for a versatile LEAN prover [56.34998574538897]
We propose LEAN-GitHub, a dataset consisting of large-scale formal data extracted from Lean 4 repositories on GitHub. Our model achieved accuracies of 48.8% with a single pass and 54.5% with 64 passes on the Lean 4 miniF2F test, surpassing state-of-the-art method at 52%.
arXiv Detail & Related papers (2024-07-24T12:28:03Z)
DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data [65.5290035371111]
We introduce an approach to generate extensive Lean 4 proof data derived from high-school and undergraduate-level mathematical competition problems. We fine-tune the DeepSeekMath 7B model on this synthetic dataset, which comprises 8 million formal statements with proofs. Our model successfully proved 5 out of 148 problems in the Lean 4 Formalized International Mathematical Olympiad (FIMO) benchmark, while GPT-4 failed to prove any.
arXiv Detail & Related papers (2024-05-23T09:03:42Z)
MathGenie: Generating Synthetic Data with Question Back-translation for Enhancing Mathematical Reasoning of LLMs [38.127313175508746]
MathGenie is a novel method for generating diverse and reliable math problems from a small-scale problem-solution dataset. Various pretrained models, ranging from 7B to 70B, are trained on the newly curated data to test the effectiveness of the proposed augmentation technique. MathGenieLM-InternLM2 achieves an accuracy of 87.7% on GSM8K and 55.7% on MATH, securing the best overall score among open-source language models.
arXiv Detail & Related papers (2024-02-26T07:17:25Z)
MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data [85.50740598523818]
MUSTARD is a framework that masters uniform synthesis of theorem and proof data of high quality and diversity. We present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points. We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data.
arXiv Detail & Related papers (2024-02-14T05:57:58Z)
Augmenting Math Word Problems via Iterative Question Composing [7.493665644128088]
We introduce the MMIQC dataset, comprising a mixture of processed web data and synthetic question-response pairs.<n>Qwen-72B-MMIQC achieves a 45.0% accuracy, exceeding the previous open-source state-of-the-art by 8.2% and outperforming the initial version GPT-4 released in 2023.
arXiv Detail & Related papers (2024-01-17T06:48:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.