Step-Wise Formal Verification for LLM-Based Mathematical Problem Solving
- URL: http://arxiv.org/abs/2505.20869v1
- Date: Tue, 27 May 2025 08:21:07 GMT
- Title: Step-Wise Formal Verification for LLM-Based Mathematical Problem Solving
- Authors: Kuo Zhou, Lu Zhang,
- Abstract summary: Large Language Models (LLMs) have demonstrated formidable capabilities in solving mathematical problems.<n>This paper proposes a framework, MATH-VF, which includes a Formalizer and a Critic.<n>We evaluate our framework on widely used mathematical benchmarks: MATH500 and ProcessBench.
- Score: 3.2233767737586674
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have demonstrated formidable capabilities in solving mathematical problems, yet they may still commit logical reasoning and computational errors during the problem-solving process. Thus, this paper proposes a framework, MATH-VF, which includes a Formalizer and a Critic, for formally verifying the correctness of the solutions generated by large language models. Our framework first utilizes a Formalizer which employs an LLM to translate a natural language solution into a formal context. Afterward, our Critic (which integrates various external tools such as a Computer Algebra System and an SMT solver) evaluates the correctness of each statement within the formal context, and when a statement is incorrect, our Critic provides corrective feedback. We empirically investigate the effectiveness of MATH-VF in two scenarios: 1) Verification: MATH-VF is utilized to determine the correctness of a solution to a given problem. 2) Refinement: When MATH-VF identifies errors in the solution generated by an LLM-based solution generator for a given problem, it submits the corrective suggestions proposed by the Critic to the solution generator to regenerate the solution. We evaluate our framework on widely used mathematical benchmarks: MATH500 and ProcessBench, demonstrating the superiority of our approach over existing approaches.
Related papers
- EHOP: A Dataset of Everyday NP-Hard Optimization Problems [66.41749917354159]
Everyday Hard Optimization Problems (EHOP) is a collection of NP-hard optimization problems expressed in natural language.<n>EHOP includes problem formulations that could be found in computer science textbooks, versions that are dressed up as problems that could arise in real life, and variants of well-known problems with inverted rules.<n>We find that state-of-the-art LLMs, across multiple prompting strategies, systematically solve textbook problems more accurately than their real-life and inverted counterparts.
arXiv Detail & Related papers (2025-02-19T14:39:59Z) - Learning by Analogy: Enhancing Few-Shot Prompting for Math Word Problem Solving with Computational Graph-Based Retrieval [22.865124583257987]
We present how analogy from similarly structured questions can improve large language models' problem-solving capabilities.
Specifically, we rely on the retrieval of problems with similar computational graphs to the given question to serve as exemplars in the prompt.
Empirical results across six math word problem datasets demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2024-11-25T15:01:25Z) - Autoformulation of Mathematical Optimization Models Using LLMs [50.030647274271516]
We develop an automated approach to creating optimization models from natural language descriptions for commercial solvers.
We identify the three core challenges of autoformulation: (1) defining the vast, problem-dependent hypothesis space, (2) efficiently searching this space under uncertainty, and (3) evaluating formulation correctness.
arXiv Detail & Related papers (2024-11-03T20:41:38Z) - ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection [60.297079601066784]
We introduce ErrorRadar, the first benchmark designed to assess MLLMs' capabilities in error detection.
ErrorRadar evaluates two sub-tasks: error step identification and error categorization.
It consists of 2,500 high-quality multimodal K-12 mathematical problems, collected from real-world student interactions.
Results indicate significant challenges still remain, as GPT-4o with best performance is still around 10% behind human evaluation.
arXiv Detail & Related papers (2024-10-06T14:59:09Z) - Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning.
LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors.
We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z) - MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models.
It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths.
It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z) - V-STaR: Training Verifiers for Self-Taught Reasoners [71.53113558733227]
V-STaR trains a verifier using DPO that judges correctness of model-generated solutions.
Running V-STaR for multiple iterations results in progressively better reasoners and verifiers.
arXiv Detail & Related papers (2024-02-09T15:02:56Z) - Language Models for Business Optimisation with a Real World Case Study in Production Scheduling [3.224702011999591]
Large Language Models (LLMs) have demonstrated outstanding performance across different language-related tasks.<n>We present an LLM-based framework for automating problem formulation in business optimisation.
arXiv Detail & Related papers (2023-09-22T23:45:21Z) - Explaining Competitive-Level Programming Solutions using LLMs [3.560501183771493]
We show that despite poor performance in solving competitive-level programming problems, state-of-the-art LLMs exhibit a strong capacity in describing and explaining solutions.
Our explanation generation methodology can generate a structured solution explanation for the problem containing descriptions and analysis.
arXiv Detail & Related papers (2023-07-11T15:26:49Z) - Highlighting Named Entities in Input for Auto-Formulation of
Optimization Problems [0.0]
This paper presents an approach that converts linear programming word problems into mathematical formulations.
We leverage the named entities in the input and augment the input to highlight these entities.
Our approach achieves the highest accuracy among all submissions to the NL4Opt Competition, securing first place in the generation track.
arXiv Detail & Related papers (2022-12-26T16:13:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.