Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification
- URL: http://arxiv.org/abs/2506.04592v1
- Date: Thu, 05 Jun 2025 03:16:08 GMT
- Title: Safe: Enhancing Mathematical Reasoning in Large Language Models via Retrospective Step-aware Formal Verification
- Authors: Chengwu Liu, Ye Yuan, Yichun Yin, Yan Xu, Xin Xu, Zaoyu Chen, Yasheng Wang, Lifeng Shang, Qun Liu, Ming Zhang,
- Abstract summary: Chain-of-Thought prompting has become the de facto method to elicit reasoning capabilities from large language models (LLMs)<n>To mitigate hallucinations in CoT that are notoriously difficult to detect, current methods operate as opaque boxes and do not provide checkable evidence for their judgments, possibly limiting their effectiveness.<n>We propose a retrospective, step-aware formal verification framework $Safe$. Rather than assigning arbitrary scores, we strive to articulate mathematical claims in formal mathematical language Lean 4 at each reasoning step and provide formal proofs to identify hallucinations.
- Score: 56.218970738892764
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Chain-of-Thought (CoT) prompting has become the de facto method to elicit reasoning capabilities from large language models (LLMs). However, to mitigate hallucinations in CoT that are notoriously difficult to detect, current methods such as process reward models (PRMs) or self-consistency operate as opaque boxes and do not provide checkable evidence for their judgments, possibly limiting their effectiveness. To address this issue, we draw inspiration from the idea that "the gold standard for supporting a mathematical claim is to provide a proof". We propose a retrospective, step-aware formal verification framework $Safe$. Rather than assigning arbitrary scores, we strive to articulate mathematical claims in formal mathematical language Lean 4 at each reasoning step and provide formal proofs to identify hallucinations. We evaluate our framework $Safe$ across multiple language models and various mathematical datasets, demonstrating a significant performance improvement while offering interpretable and verifiable evidence. We also propose $FormalStep$ as a benchmark for step correctness theorem proving with $30,809$ formal statements. To the best of our knowledge, our work represents the first endeavor to utilize formal mathematical language Lean 4 for verifying natural language content generated by LLMs, aligning with the reason why formal mathematical languages were created in the first place: to provide a robust foundation for hallucination-prone human-written proofs.
Related papers
- Seed-Prover: Deep and Broad Reasoning for Automated Theorem Proving [36.20164235042574]
In this work, we propose textbfSeed-Prover, a lemma-style whole-proof reasoning model.<n>To solve IMO-level contest problems, we design three test-time inference strategies that enable both deep and broad reasoning.<n>Seed-Prover proves $78.1%$ of formalized past IMO problems, saturates MiniF2F, and achieves over 50% on PutnamBench, outperforming the previous state-of-the-art by a large margin.
arXiv Detail & Related papers (2025-07-31T17:00:30Z) - Can Large Language Models Learn Formal Logic? A Data-Driven Training and Evaluation Framework [2.9334627971166336]
This paper investigates the logical reasoning capabilities of large language models (LLMs)<n>A trained LLM receives as input a set of assumptions and a goal, and produces as output a proof that formally derives the goal from the assumptions.<n>A critical obstacle for training is the scarcity of real-world proofs.
arXiv Detail & Related papers (2025-04-28T19:25:29Z) - LeanProgress: Guiding Search for Neural Theorem Proving via Proof Progress Prediction [74.79306773878955]
We introduce LeanProgress, a method that predicts the progress in the proof.<n>Our experiments show that LeanProgress achieves an overall prediction accuracy of 75.1%.
arXiv Detail & Related papers (2025-02-25T07:46:36Z) - Mathematical Formalized Problem Solving and Theorem Proving in Different Fields in Lean 4 [0.0]
This paper explores the use of Large Language Models (LLMs) to generate formal proof steps and complete formalized proofs.
The goal is to determine how AI can be leveraged to assist the mathematical formalization process and improve its performance.
arXiv Detail & Related papers (2024-09-09T18:21:28Z) - Lean-STaR: Learning to Interleave Thinking and Proving [53.923617816215774]
We present Lean-STaR, a framework for training language models to produce informal thoughts prior to each step of a proof.<n>Lean-STaR achieves state-of-the-art results on the miniF2F-test benchmark within the Lean theorem proving environment.
arXiv Detail & Related papers (2024-07-14T01:43:07Z) - MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data [85.50740598523818]
MUSTARD is a framework that masters uniform synthesis of theorem and proof data of high quality and diversity.
We present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points.
We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data.
arXiv Detail & Related papers (2024-02-14T05:57:58Z) - Towards a Mathematics Formalisation Assistant using Large Language
Models [5.485439959027125]
We explore the abilities of a large language model (Codex) to help with formalisation in the Lean theorem prover.
Codex is able to formalise short mathematical statements at undergrad level with nearly 75% accuracy for $120$ theorem statements.
We show that with a new prompting strategy Codex can formalise these proofs in natural language with at least one out of twelve Codex completion being easy to repair into a complete proof.
arXiv Detail & Related papers (2022-11-14T16:52:32Z) - NaturalProver: Grounded Mathematical Proof Generation with Language
Models [84.2064569475095]
Theorem proving in natural mathematical language plays a central role in mathematical advances and education.
We develop NaturalProver, a language model that generates proofs by conditioning on background references.
NaturalProver is capable of proving some theorems that require short (2-6 step) proofs, and providing next-step suggestions that are rated as correct and useful over 40% of the time.
arXiv Detail & Related papers (2022-05-25T17:01:18Z) - Generating Natural Language Proofs with Verifier-Guided Search [74.9614610172561]
We present a novel stepwise method NLProofS (Natural Language Proof Search)
NLProofS learns to generate relevant steps conditioning on the hypothesis.
It achieves state-of-the-art performance on EntailmentBank and RuleTaker.
arXiv Detail & Related papers (2022-05-25T02:22:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.