Solving Formal Math Problems by Decomposition and Iterative Reflection
- URL: http://arxiv.org/abs/2507.15225v1
- Date: Mon, 21 Jul 2025 03:56:35 GMT
- Title: Solving Formal Math Problems by Decomposition and Iterative Reflection
- Authors: Yichi Zhou, Jianqiu Zhao, Yongxin Zhang, Bohan Wang, Siran Wang, Luoxin Chen, Jiahui Wang, Haowei Chen, Allan Jie, Xinbo Zhang, Haocheng Wang, Luong Trung, Rong Ye, Phan Nhat Hoang, Huishuai Zhang, Peng Sun, Hang Li,
- Abstract summary: textbfDelta Prover orchestrates the interaction between a general-purpose LLM and the Lean 4 proof environment.<n>bftextDelta Prover achieves a state-of-the-art 95.9% success rate on the miniF2F-test benchmark.
- Score: 30.54275542622631
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: General-purpose Large Language Models (LLMs) have achieved remarkable success in intelligence, performing comparably to human experts on complex reasoning tasks such as coding and mathematical reasoning. However, generating formal proofs in specialized languages like Lean 4 remains a significant challenge for these models, limiting their application in complex theorem proving and automated verification. Current approaches typically require specializing models through fine-tuning on dedicated formal corpora, incurring high costs for data collection and training. In this work, we introduce \textbf{Delta Prover}, an agent-based framework that orchestrates the interaction between a general-purpose LLM and the Lean 4 proof environment. Delta Prover leverages the reflection and reasoning capabilities of general-purpose LLMs to interactively construct formal proofs in Lean 4, circumventing the need for model specialization. At its core, the agent integrates two novel, interdependent components: an algorithmic framework for reflective decomposition and iterative proof repair, and a custom Domain-Specific Language (DSL) built upon Lean 4 for streamlined subproblem management. \textbf{Delta Prover achieves a state-of-the-art 95.9\% success rate on the miniF2F-test benchmark, surpassing all existing approaches, including those requiring model specialization.} Furthermore, Delta Prover exhibits a significantly stronger test-time scaling law compared to standard Best-of-N proof strategies. Crucially, our findings demonstrate that general-purpose LLMs, when guided by an effective agentic structure, possess substantial untapped theorem-proving capabilities. This presents a computationally efficient alternative to specialized models for robust automated reasoning in formal environments.
Related papers
- Dual-Phase LLM Reasoning: Self-Evolved Mathematical Frameworks [48.105258051884384]
This paper proposes a new two-stage training framework that enhances models' self-correction capabilities.<n>During the first stage, a multi-turn dialogue strategy guides the model to generate long chain-of-thought (CoT) data.<n>The second stage employs a difficulty-aware rejection sampling mechanism to dynamically optimize data distribution.
arXiv Detail & Related papers (2026-01-09T08:19:11Z) - BRIDGE: Building Representations In Domain Guided Program Verification [67.36686119518441]
BRIDGE decomposes verification into three interconnected domains: Code, Specifications, and Proofs.<n>We show that this approach substantially improves both accuracy and efficiency beyond standard error feedback methods.
arXiv Detail & Related papers (2025-11-26T06:39:19Z) - Large Language Model enabled Mathematical Modeling [2.132096006921049]
This research investigates the potential of Large Language Models (LLMs) to bridge the formulation gap using natural language understanding and code generation.<n>DeepSeek-R1 is a cost-efficient and high-performing model trained with reinforcement learning.<n>Our methodology includes baseline assessments, the development of a hallucination taxonomy, and the application of mitigation strategies.
arXiv Detail & Related papers (2025-10-22T17:41:42Z) - Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics [1.2978846076301875]
Ax-Prover is a multi-agent system for automated theorem proving in Lean.<n>It can solve problems across diverse scientific domains and operate either autonomously or collaboratively with human experts.<n>We benchmark our approach against frontier LLMs and specialized prover models on two public math benchmarks and on two Lean benchmarks we introduce in the fields of abstract algebra and quantum theory.
arXiv Detail & Related papers (2025-10-14T17:57:04Z) - MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization [103.74675519953898]
Long-chain reflective reasoning is a prerequisite for solving complex real-world problems.<n>We build a benchmark consisting 1,260 samples of 42 challenging synthetic tasks.<n>We generate post-training data and explore learning paradigms for exploiting such data.
arXiv Detail & Related papers (2025-10-09T17:53:58Z) - FormalML: A Benchmark for Evaluating Formal Subgoal Completion in Machine Learning Theory [44.64175433092553]
Large language models (LLMs) have recently demonstrated remarkable progress in formal theorem proving.<n>Yet their ability to serve as practical assistants for mathematicians, filling in missing steps within complex proofs, remains underexplored.<n>We introduce FormalML, a Lean 4 benchmark built from foundational theories of machine learning.
arXiv Detail & Related papers (2025-09-26T14:40:14Z) - Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs [102.48588475875749]
We introduce Generative Self-Refinement (GSR), a novel parallel test-time scaling framework.<n>GSR generates a set of candidate responses in parallel and then performs self-refinement to synthesize a new superior solution.<n>We show that our method achieves state-of-the-art performance across five mathematical benchmarks.
arXiv Detail & Related papers (2025-08-27T06:51:48Z) - Re:Form -- Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny [68.00108157244952]
Large Language Models (LLMs) trained with Reinforcement Learning (RL) face a significant challenge: their verification processes are neither reliable nor scalable.<n>A promising yet largely uncharted alternative is formal language-based reasoning.<n>Grounding LLMs in rigorous formal systems where generative models operate in formal language spaces (e.g., Dafny) enables the automatic and mathematically provable verification of their reasoning processes and outcomes.
arXiv Detail & Related papers (2025-07-22T08:13:01Z) - ProofCompass: Enhancing Specialized Provers with LLM Guidance [6.757964026033364]
This paper introduces Proof, a novel hybrid methodology that achieves remarkable computational efficiency.<n>It strategically guides existing specialized prover methods, such as DeepSeek-Prover-v1.5-RL (DSP-v1.5) with a Large Language Model (LLM) without requiring additional model training.<n>On the miniF2F benchmark, Proof demonstrates substantial resource efficiency: it outperforms DSP-v1.5 ($54.9% rightarrow 55.3%$) while using 25 fewer attempts ($3200 rightarrow 128$)
arXiv Detail & Related papers (2025-07-18T19:28:01Z) - ProtoReasoning: Prototypes as the Foundation for Generalizable Reasoning in LLMs [54.154593699263074]
ProtoReasoning is a framework that enhances the reasoning ability of Large Reasoning Models.<n>ProtoReasoning transforms problems into corresponding prototype representations.<n>ProtoReasoning achieves 4.7% improvement over baseline models on logical reasoning.
arXiv Detail & Related papers (2025-06-18T07:44:09Z) - Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning [80.27561080938747]
We propose a systematic framework, CANOE, to improve the faithfulness of large language models (LLMs) in both short-form and long-form generation tasks without human annotations.<n>Also, we propose Dual-GRPO, a rule-based reinforcement learning method that includes three tailored rule-based rewards derived from synthesized short-form QA data.<n> Experimental results show that CANOE greatly improves the faithfulness of LLMs across 11 different downstream tasks, even outperforming the most advanced LLMs.
arXiv Detail & Related papers (2025-05-22T10:10:07Z) - General-Reasoner: Advancing LLM Reasoning Across All Domains [64.70599911897595]
Reinforcement learning (RL) has recently demonstrated strong potential in enhancing the reasoning capabilities of large language models (LLMs)<n>We propose General-Reasoner, a novel training paradigm designed to enhance LLM reasoning capabilities across diverse domains.<n>We train a series of models and evaluate them on a wide range of datasets covering wide domains like physics, chemistry, finance, electronics etc.
arXiv Detail & Related papers (2025-05-20T17:41:33Z) - FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models [17.919212265668783]
We present FormalMATH, a large-scale Lean4 benchmark comprising 5,560 formally verified problems spanning from high-school Olympiad challenges to undergraduate-level theorems.<n>We introduce a novel human-in-the-loop autoformalization pipeline that integrates statement autoformalization, semantic verification, and negation-based disproof filtering strategies.<n>Our evaluation of state-of-the-art LLM-based theorem provers reveals significant limitations.
arXiv Detail & Related papers (2025-05-05T15:37:00Z) - MA-LoT: Model-Collaboration Lean-based Long Chain-of-Thought Reasoning enhances Formal Theorem Proving [30.112351299773632]
We propose a comprehensive framework for Lean4 theorem proving to solve this issue.<n>It separates the cognition tasks of general NL for whole-proof generation and error analysis for proof correction.<n>Our framework achieves a **61.07%** accuracy rate on the Lean4 version of the MiniF2F-Test dataset.
arXiv Detail & Related papers (2025-03-05T05:50:31Z) - Alchemy: Amplifying Theorem-Proving Capability through Symbolic Mutation [71.32761934724867]
This work proposes Alchemy, a framework for data synthesis that constructs formal theorems through symbolic mutation.<n>For each candidate theorem in Mathlib, we identify all invocable theorems that can be used to rewrite or apply to it.<n>As a result, our method increases the number of theorems in Mathlib by an order of magnitude, from 110k to 6M.
arXiv Detail & Related papers (2024-10-21T08:04:21Z) - Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities.
In-Context Learning (ICL) and.
Efficient Fine-Tuning (PEFT) are currently two mainstream methods for augmenting.
LLMs to downstream tasks.
We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z) - Improving Retrieval Augmented Language Model with Self-Reasoning [20.715106330314605]
We propose a novel self-reasoning framework aimed at improving the reliability and traceability of RALMs.<n>The framework involves constructing self-reason trajectories with three processes: a relevance-aware process, an evidence-aware selective process, and a trajectory analysis process.<n>We have evaluated our framework across four public datasets to demonstrate the superiority of our method.
arXiv Detail & Related papers (2024-07-29T09:05:10Z) - Large Language Model Agent as a Mechanical Designer [7.136205674624813]
We propose a framework that leverages a pretrained Large Language Model (LLM) in conjunction with an FEM module to autonomously generate, evaluate, and refine structural designs.<n>LLM operates without domain-specific fine-tuning, using general reasoning to propose design candidates, interpret FEM-derived performance metrics, and apply structurally sound modifications.<n>Compared to Non- Sorting Genetic Algorithm II (NSGA-II), our method achieves faster convergence and fewer FEM evaluations.
arXiv Detail & Related papers (2024-04-26T16:41:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.