StepProof: Step-by-step verification of natural language mathematical proofs
- URL: http://arxiv.org/abs/2506.10558v2
- Date: Mon, 30 Jun 2025 10:59:44 GMT
- Title: StepProof: Step-by-step verification of natural language mathematical proofs
- Authors: Xiaolin Hu, Qinghua Zhou, Bogdan Grechuk, Ivan Y. Tyukin,
- Abstract summary: We propose StepProof, a novel autoformalization method for granular, step-by-step verification.<n>StepProof breaks down complete proofs into multiple verifiable subproofs, enabling sentence-level verification.<n>We show that StepProof significantly improves proof success rates and efficiency compared to traditional methods.
- Score: 16.150265021594088
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Interactive theorem provers (ITPs) are powerful tools for the formal verification of mathematical proofs down to the axiom level. However, their lack of a natural language interface remains a significant limitation. Recent advancements in large language models (LLMs) have enhanced the understanding of natural language inputs, paving the way for autoformalization - the process of translating natural language proofs into formal proofs that can be verified. Despite these advancements, existing autoformalization approaches are limited to verifying complete proofs and lack the capability for finer, sentence-level verification. To address this gap, we propose StepProof, a novel autoformalization method designed for granular, step-by-step verification. StepProof breaks down complete proofs into multiple verifiable subproofs, enabling sentence-level verification. Experimental results demonstrate that StepProof significantly improves proof success rates and efficiency compared to traditional methods. Additionally, we found that minor manual adjustments to the natural language proofs, tailoring them for step-level verification, further enhanced StepProof's performance in autoformalization.
Related papers
- Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification [56.218970738892764]
Chain-of-Thought prompting has become the de facto method to elicit reasoning capabilities from large language models (LLMs)<n>To mitigate hallucinations in CoT that are notoriously difficult to detect, current methods operate as opaque boxes and do not provide checkable evidence for their judgments, possibly limiting their effectiveness.<n>We propose a retrospective, step-aware formal verification framework $Safe$. Rather than assigning arbitrary scores, we strive to articulate mathematical claims in formal mathematical language Lean 4 at each reasoning step and provide formal proofs to identify hallucinations.
arXiv Detail & Related papers (2025-06-05T03:16:08Z) - Can Large Language Models Learn Formal Logic? A Data-Driven Training and Evaluation Framework [2.9334627971166336]
This paper investigates the logical reasoning capabilities of large language models (LLMs)<n>A trained LLM receives as input a set of assumptions and a goal, and produces as output a proof that formally derives the goal from the assumptions.<n>A critical obstacle for training is the scarcity of real-world proofs.
arXiv Detail & Related papers (2025-04-28T19:25:29Z) - LeanProgress: Guiding Search for Neural Theorem Proving via Proof Progress Prediction [74.79306773878955]
We introduce LeanProgress, a method that predicts the progress in the proof.<n>Our experiments show that LeanProgress achieves an overall prediction accuracy of 75.1%.
arXiv Detail & Related papers (2025-02-25T07:46:36Z) - Next-Token Prediction Task Assumes Optimal Data Ordering for LLM Training in Proof Generation [27.60611509339328]
We argue that the optimal order for one training data sample occurs when the relevant intermediate supervision for a particular proof step is always positioned to the left of that proof step.
We demonstrate that training is most effective when the proof is in the intuitively sequential order.
arXiv Detail & Related papers (2024-10-30T18:00:04Z) - Automated Proof Generation for Rust Code via Self-Evolution [69.25795662658356]
We introduce SAFE, a framework that overcomes the lack of human-written snippets to enable automated proof generation of Rust code.<n> SAFE re-purposes the large number of synthesized incorrect proofs to train the self-ging capability of the fine-tuned models.<n>We achieve a 52.52% accuracy rate in a benchmark crafted by human experts, a significant leap over GPT-4o's performance of 14.39%.
arXiv Detail & Related papers (2024-10-21T08:15:45Z) - Lean-STaR: Learning to Interleave Thinking and Proving [53.923617816215774]
We present Lean-STaR, a framework for training language models to produce informal thoughts prior to each step of a proof.<n>Lean-STaR achieves state-of-the-art results on the miniF2F-test benchmark within the Lean theorem proving environment.
arXiv Detail & Related papers (2024-07-14T01:43:07Z) - Improving Autoformalization using Type Checking [15.58948808529849]
We analyze both current autoformalization methods and the processes used to evaluate them, focusing specifically on the Lean 4 theorem proving language.<n>We demonstrate that scaling type-check filtering with self-consistency techniques on top of existing methods significantly improves performance, achieving absolute accuracy gains of up to +18.4% on ProofNet.<n>We also release new benchmarks: a new research-level mathematics dataset RLM25, a corrected ProofNet, and ProofNetVerif with labeled correct and incorrect autoformalization pairs for evaluating metrics.
arXiv Detail & Related papers (2024-06-11T13:01:50Z) - Proving Theorems Recursively [80.42431358105482]
We propose POETRY, which proves theorems in a level-by-level manner.
Unlike previous step-by-step methods, POETRY searches for a sketch of the proof at each level.
We observe a substantial increase in the maximum proof length found by POETRY, from 10 to 26.
arXiv Detail & Related papers (2024-05-23T10:35:08Z) - Verifying Quantum Phase Estimation (QPE) using Prove-It [0.0]
The general-purpose interactive theorem-proving assistant called Prove-It was used to verify the Quantum Phase Estimation (QPE) algorithm.
Prove-It is unique in its ability to express sophisticated mathematical statements, including statements about quantum circuits.
arXiv Detail & Related papers (2023-04-05T01:21:00Z) - Generating Natural Language Proofs with Verifier-Guided Search [74.9614610172561]
We present a novel stepwise method NLProofS (Natural Language Proof Search)
NLProofS learns to generate relevant steps conditioning on the hypothesis.
It achieves state-of-the-art performance on EntailmentBank and RuleTaker.
arXiv Detail & Related papers (2022-05-25T02:22:30Z) - Natural Language Specifications in Proof Assistants [0.0]
It is possible to build support for natural language specifications within existing proof assistants.
This paper argues that it is possible to build support for natural language specifications within existing proof assistants.
arXiv Detail & Related papers (2022-05-16T17:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.