The Open Proof Corpus: A Large-Scale Study of LLM-Generated Mathematical Proofs
- URL: http://arxiv.org/abs/2506.21621v1
- Date: Mon, 23 Jun 2025 13:31:58 GMT
- Title: The Open Proof Corpus: A Large-Scale Study of LLM-Generated Mathematical Proofs
- Authors: Jasper Dekoninck, Ivo Petrov, Kristian Minchev, Mislav Balunovic, Martin Vechev, Miroslav Marinov, Maria Drencheva, Lyuba Konova, Milen Shumanov, Kaloyan Tsvetkov, Nikolay Drenchev, Lazar Todorov, Kalina Nikolova, Nikolay Georgiev, Vanesa Kalinkova, Margulan Ismoldayev,
- Abstract summary: We present the Open Proof Corpus (OPC), a dataset comprising over 5,000 human-evaluated proofs produced by state-of-the-art LLMs.<n>OPC was specifically designed for broad applicability and downstream usage in proof generation research.<n>We explore critical questions in automated proof generation: (1) the performance gap between natural language and formal proof generation, (2) the discrepancy between final-answer accuracy and full-proof validity, and (3) the impact of best-of-n selection on proof quality.
- Score: 7.20909461915203
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent months, large language models (LLMs) have made significant progress in mathematical proof generation, but further advancement is hindered by the lack of a large-scale, high-quality dataset of human-evaluated proofs. While expensive to create, such a dataset is essential for driving improvements in training and enabling a rigorous analysis of proof generation capabilities. In this work, we present the Open Proof Corpus (OPC), a dataset comprising over 5,000 human-evaluated proofs produced by state-of-the-art LLMs. The OPC was specifically designed for broad applicability and downstream usage in proof generation research and is the first to include a substantial number of correct, LLM-generated solutions to problems from prestigious mathematics competitions such as the USAMO and IMO. Using the OPC, we explore critical questions in automated proof generation: (1) the performance gap between natural language and formal proof generation, (2) the discrepancy between final-answer accuracy and full-proof validity, and (3) the impact of best-of-n selection on proof quality. Finally, to showcase the utility of the OPC, we finetune an 8B-parameter model on the dataset, obtaining a model that performs on par with the best model, Gemini-2.5-Pro, on the task of evaluating proof correctness.
Related papers
- Solving Inequality Proofs with Large Language Models [46.71658812761115]
Inequality proving is crucial across diverse scientific and mathematical fields.<n>This makes it a demanding frontier for large language models (LLMs)<n>We release IneqMath, an expert-curated dataset of Olympiad-level inequalities.
arXiv Detail & Related papers (2025-06-09T16:43:38Z) - Towards Reliable Proof Generation with LLMs: A Neuro-Symbolic Approach [14.213719696233934]
Large language models (LLMs) struggle with formal domains that require rigorous logical deduction and symbolic reasoning.<n>We propose a neuro-symbolic approach that combines LLMs' generative strengths with structured components to overcome this challenge.
arXiv Detail & Related papers (2025-05-20T15:13:32Z) - LeanProgress: Guiding Search for Neural Theorem Proving via Proof Progress Prediction [74.79306773878955]
We introduce LeanProgress, a method that predicts the progress in the proof.<n>Our experiments show that LeanProgress achieves an overall prediction accuracy of 75.1%.
arXiv Detail & Related papers (2025-02-25T07:46:36Z) - Generating Millions Of Lean Theorems With Proofs By Exploring State Transition Graphs [6.65877320351217]
We developed LeanNavigator, a novel method for generating a large-scale dataset of Lean theorems and proofs.<n>We generated 4.7 million theorems totaling 1 billion tokens, surpassing previous datasets by more than an order of magnitude.<n>Using this extensive dataset, we trained an AI model that outperforms the state-of-the-art ReProver model in theorem-proving tasks.
arXiv Detail & Related papers (2025-02-16T06:20:39Z) - Lean-STaR: Learning to Interleave Thinking and Proving [53.923617816215774]
We present Lean-STaR, a framework for training language models to produce informal thoughts prior to each step of a proof.<n>Lean-STaR achieves state-of-the-art results on the miniF2F-test benchmark within the Lean theorem proving environment.
arXiv Detail & Related papers (2024-07-14T01:43:07Z) - DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data [65.5290035371111]
We introduce an approach to generate extensive Lean 4 proof data derived from high-school and undergraduate-level mathematical competition problems.
We fine-tune the DeepSeekMath 7B model on this synthetic dataset, which comprises 8 million formal statements with proofs.
Our model successfully proved 5 out of 148 problems in the Lean 4 Formalized International Mathematical Olympiad (FIMO) benchmark, while GPT-4 failed to prove any.
arXiv Detail & Related papers (2024-05-23T09:03:42Z) - Enhancing Formal Theorem Proving: A Comprehensive Dataset for Training AI Models on Coq Code [0.0]
The Coq proof assistant stands out for its rigorous approach to verifying mathematical assertions and software correctness.
Despite the advances in artificial intelligence and machine learning, the specialized nature of Coq syntax and semantics poses unique challenges for Large Language Models (LLMs)
This dataset, derived from a collection of over 10,000 Coq source files, encompasses a wide array of propositions, proofs, and definitions.
arXiv Detail & Related papers (2024-03-19T10:53:40Z) - MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data [85.50740598523818]
MUSTARD is a framework that masters uniform synthesis of theorem and proof data of high quality and diversity.
We present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points.
We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data.
arXiv Detail & Related papers (2024-02-14T05:57:58Z) - LeanDojo: Theorem Proving with Retrieval-Augmented Language Models [72.54339382005732]
Large language models (LLMs) have shown promise in proving formal theorems using proof assistants such as Lean.
Existing methods are difficult to reproduce or build on, due to private code, data, and compute requirements.
This paper introduces LeanDojo: an open-source Lean toolkit consisting of toolkits, data, models.
We develop ReProver: an LLM-based prover augmented with retrieval for selecting premises from a vast math library.
arXiv Detail & Related papers (2023-06-27T17:05:32Z) - PRover: Proof Generation for Interpretable Reasoning over Rules [81.40404921232192]
We propose a transformer-based model that answers binary questions over rule-bases and generates the corresponding proofs.
Our model learns to predict nodes and edges corresponding to proof graphs in an efficient constrained training paradigm.
We conduct experiments on synthetic, hand-authored, and human-paraphrased rule-bases to show promising results for QA and proof generation.
arXiv Detail & Related papers (2020-10-06T15:47:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.