Beyond Theorem Proving: Formulation, Framework and Benchmark for Formal Problem-Solving
- URL: http://arxiv.org/abs/2505.04528v1
- Date: Wed, 07 May 2025 16:02:14 GMT
- Title: Beyond Theorem Proving: Formulation, Framework and Benchmark for Formal Problem-Solving
- Authors: Qi Liu, Xinhao Zheng, Renqiu Xia, Xingzhi Qi, Qinxiang Cao, Junchi Yan,
- Abstract summary: We present a principled formulation of problem-solving as a deterministic Markov decision process.<n>We also present a novel framework, FPS, which utilizes existing FTP environments to perform process-verified problem-solving.<n>We construct three benchmarks on problem-solving: FormalMath500, a formalization of a subset of the MATH500 benchmark; MiniF2F-Solving and PutnamBench-Solving, adaptations of FTP benchmarks MiniF2F and PutnamBench.
- Score: 44.5694120006447
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As a seemingly self-explanatory task, problem-solving has been a significant component of science and engineering. However, a general yet concrete formulation of problem-solving itself is missing. With the recent development of AI-based problem-solving agents, the demand for process-level verifiability is rapidly increasing yet underexplored. To fill these gaps, we present a principled formulation of problem-solving as a deterministic Markov decision process; a novel framework, FPS (Formal Problem-Solving), which utilizes existing FTP (formal theorem proving) environments to perform process-verified problem-solving; and D-FPS (Deductive FPS), decoupling solving and answer verification for better human-alignment. The expressiveness, soundness and completeness of the frameworks are proven. We construct three benchmarks on problem-solving: FormalMath500, a formalization of a subset of the MATH500 benchmark; MiniF2F-Solving and PutnamBench-Solving, adaptations of FTP benchmarks MiniF2F and PutnamBench. For faithful, interpretable, and human-aligned evaluation, we propose RPE (Restricted Propositional Equivalence), a symbolic approach to determine the correctness of answers by formal verification. We evaluate four prevalent FTP models and two prompting methods as baselines, solving at most 23.77% of FormalMath500, 27.47% of MiniF2F-Solving, and 0.31% of PutnamBench-Solving.
Related papers
- Towards Solving More Challenging IMO Problems via Decoupled Reasoning and Proving [48.22540519786074]
Recent studies show that the informal accuracy exceeds 80% while formal success remains below 8% on benchmarks like PutnamBench.<n>We propose a novel framework that decouples high-level reasoning from low-level proof generation.<n>We evaluate our method on a challenging set of post-2000 IMO problems, a problem set on which no prior open-source prover has reported success.
arXiv Detail & Related papers (2025-07-07T22:38:49Z) - Step-Wise Formal Verification for LLM-Based Mathematical Problem Solving [3.2233767737586674]
Large Language Models (LLMs) have demonstrated formidable capabilities in solving mathematical problems.<n>This paper proposes a framework, MATH-VF, which includes a Formalizer and a Critic.<n>We evaluate our framework on widely used mathematical benchmarks: MATH500 and ProcessBench.
arXiv Detail & Related papers (2025-05-27T08:21:07Z) - Enumerate-Conjecture-Prove: Formally Solving Answer-Construction Problems in Math Competitions [37.10426226729792]
We introduce the LLMe-Conjecture-Prove (ECP) framework, a modular neuro-symbolic method integrating pattern-driven conjecturing with formal theorem proving.<n>We present ConstructiveBench, a dataset of 3,431 answer-Thought problems in various math competitions with verified Lean formalizations.
arXiv Detail & Related papers (2025-05-24T03:52:25Z) - CombiBench: Benchmarking LLM Capability for Combinatorial Mathematics [45.32204834230928]
We introduce CombiBench, a comprehensive benchmark comprising 100 problems, each formalized in Lean4 and paired with its corresponding informal statement.<n> CombiBench is suitable for testing solving problems since 2000 (except IMO 2004 P3 as its statement contain an images)<n>We also provide a comprehensive and standardized evaluation framework, dubbed Fine-Eval, for formal mathematics.
arXiv Detail & Related papers (2025-05-06T04:32:17Z) - FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models [17.919212265668783]
We present FormalMATH, a large-scale Lean4 benchmark comprising 5,560 formally verified problems spanning from high-school Olympiad challenges to undergraduate-level theorems.<n>We introduce a novel human-in-the-loop autoformalization pipeline that integrates statement autoformalization, semantic verification, and negation-based disproof filtering strategies.<n>Our evaluation of state-of-the-art LLM-based theorem provers reveals significant limitations.
arXiv Detail & Related papers (2025-05-05T15:37:00Z) - DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition [24.50773495427783]
We introduce DeepSeek-Prover-V2, an open-source large language model designed for formal theorem proving in Lean 4.<n>The model achieves state-of-the-art performance in neural theorem proving, reaching 88.9% pass ratio on the MiniF2F-test and solving 49 out of 658 problems from PutnamBench.<n>In addition to standard benchmarks, we introduce ProverBench, a collection of 325 formalized problems, to enrich our evaluation, including 15 selected problems from the recent AIME competitions.
arXiv Detail & Related papers (2025-04-30T16:57:48Z) - PromptCoT: Synthesizing Olympiad-level Problems for Mathematical Reasoning in Large Language Models [59.920971312822736]
We introduce PromptCoT, a novel approach for automatically generating high-quality Olympiad-level math problems.<n>The proposed method synthesizes complex problems based on mathematical concepts and the rationale behind problem construction.<n>Our method is evaluated on standard benchmarks including GSM8K, MATH-500, and AIME2024, where it consistently outperforms existing problem generation methods.
arXiv Detail & Related papers (2025-03-04T06:32:30Z) - Alchemy: Amplifying Theorem-Proving Capability through Symbolic Mutation [71.32761934724867]
This work proposes Alchemy, a framework for data synthesis that constructs formal theorems through symbolic mutation.<n>For each candidate theorem in Mathlib, we identify all invocable theorems that can be used to rewrite or apply to it.<n>As a result, our method increases the number of theorems in Mathlib by an order of magnitude, from 110k to 6M.
arXiv Detail & Related papers (2024-10-21T08:04:21Z) - VC Search: Bridging the Gap Between Well-Defined and Ill-Defined Problems in Mathematical Reasoning [46.25056744404318]
We develop a benchmark called Problems with Missing and Contradictory conditions ( PMC) containing over 5,000 validated ill-defined mathematical problems.<n>VCSEARCH improves the accuracy of identifying unsolvable problems by at least 12% across different large language models.
arXiv Detail & Related papers (2024-06-07T16:24:12Z) - Lila: A Unified Benchmark for Mathematical Reasoning [59.97570380432861]
LILA is a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions.
We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs.
We introduce BHASKARA, a general-purpose mathematical reasoning model trained on LILA.
arXiv Detail & Related papers (2022-10-31T17:41:26Z) - Formally Verified Solution Methods for Infinite-Horizon Markov Decision
Processes [7.538482310185133]
We build on existing formalizations of probability theory to analyze the expected total reward criterion on infinite-horizon problems.
Our developments formalize the Bellman equation and give conditions under which optimal policies exist.
We show that our system can compete with and even outperform state-of-the-art systems.
arXiv Detail & Related papers (2022-06-05T13:03:34Z) - Logically Consistent Loss for Visual Question Answering [66.83963844316561]
The current advancement in neural-network based Visual Question Answering (VQA) cannot ensure such consistency due to identically distribution (i.i.d.) assumption.
We propose a new model-agnostic logic constraint to tackle this issue by formulating a logically consistent loss in the multi-task learning framework.
Experiments confirm that the proposed loss formulae and introduction of hybrid-batch leads to more consistency as well as better performance.
arXiv Detail & Related papers (2020-11-19T20:31:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.