Related papers: Enhancing Mathematical Reasoning in LLMs by Stepwise Correction

Enhancing Mathematical Reasoning in LLMs by Stepwise Correction

URL: http://arxiv.org/abs/2410.12934v1
Date: Wed, 16 Oct 2024 18:18:42 GMT
Title: Enhancing Mathematical Reasoning in LLMs by Stepwise Correction
Authors: Zhenyu Wu, Qingkai Zeng, Zhihan Zhang, Zhaoxuan Tan, Chao Shen, Meng Jiang,
Abstract summary: Best-of-N decoding methods instruct large language models (LLMs) to generate multiple solutions, score each using a scoring function, and select the highest scored as the final answer to mathematical reasoning problems. We propose a novel prompting method named Stepwise Correction (StepCo) that helps LLMs identify and revise incorrect steps in their generated reasoning paths. The verify-then-revise process not only improves answer correctness but also reduces token consumption with fewer paths needed to generate.
Score: 39.67266805233599
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Best-of-N decoding methods instruct large language models (LLMs) to generate multiple solutions, score each using a scoring function, and select the highest scored as the final answer to mathematical reasoning problems. However, this repeated independent process often leads to the same mistakes, making the selected solution still incorrect. We propose a novel prompting method named Stepwise Correction (StepCo) that helps LLMs identify and revise incorrect steps in their generated reasoning paths. It iterates verification and revision phases that employ a process-supervised verifier. The verify-then-revise process not only improves answer correctness but also reduces token consumption with fewer paths needed to generate. With StepCo, a series of LLMs demonstrate exceptional performance. Notably, using GPT-4o as the backend LLM, StepCo achieves an average accuracy of 94.1 across eight datasets, significantly outperforming the state-of-the-art Best-of-N method by +2.4, while reducing token consumption by 77.8%.

Related papers

BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning [83.03531832811386]
BoostStep is a method that enhances reasoning accuracy through step-aligned ICL examples. It integrates seamlessly with chain-of-thought (CoT) and tree search algorithms. It improves DeepSeek-R1-671B's performance on AIME by 2.2%, leveraging simple examples only from the MATH dataset.
arXiv Detail & Related papers (2025-01-06T18:59:13Z)
Planning-Driven Programming: A Large Language Model Programming Workflow [8.827173113748701]
Large language models (LLMs) have strong performance on natural language processing tasks. Recent work suggests multiple sampling approaches to improve initial code generation accuracy or program repair approaches to refine the code. We propose an LLM programming workflow (LPW) designed to improve both initial code generation and subsequent refinements.
arXiv Detail & Related papers (2024-11-21T08:31:06Z)
Subtle Errors Matter: Preference Learning via Error-injected Self-editing [59.405145971637204]
We propose a novel preference learning framework called eRror-Injected Self-Editing (RISE) RISE injects predefined subtle errors into pivotal tokens in reasoning or steps to construct hard pairs for error mitigation. Experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH with only 4.5K training samples.
arXiv Detail & Related papers (2024-10-09T07:43:38Z)
Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning. LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors. We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z)
Program Slicing in the Era of Large Language Models [7.990456190723922]
Program slicing is a critical technique in software engineering, enabling developers to isolate relevant portions of code. This study investigates the application of large language models (LLMs) to both static and dynamic program slicing.
arXiv Detail & Related papers (2024-09-19T00:07:56Z)
Learning to Check: Unleashing Potentials for Self-Correction in Large Language Models [5.463333911506443]
We aim to enhance the self-checking capabilities of large language models (LLMs) by constructing training data for checking tasks. We propose a specialized checking format called "Step CoT Check" Experiments demonstrate that fine-tuning with the "Step CoT Check" format significantly improves the self-checking and self-correction abilities of LLMs.
arXiv Detail & Related papers (2024-02-20T14:23:23Z)
V-STaR: Training Verifiers for Self-Taught Reasoners [71.53113558733227]
V-STaR trains a verifier using DPO that judges correctness of model-generated solutions. Running V-STaR for multiple iterations results in progressively better reasoners and verifiers.
arXiv Detail & Related papers (2024-02-09T15:02:56Z)
Evaluating and Improving Tool-Augmented Computation-Intensive Math Reasoning [75.74103236299477]
Chain-of-thought prompting(CoT) and tool augmentation have been validated as effective practices for improving large language models. We propose a new approach that can deliberate the reasoning steps with tool interfaces, namely textbfDELI. Experimental results on CARP and six other datasets show that the proposed DELI mostly outperforms competitive baselines.
arXiv Detail & Related papers (2023-06-04T17:02:59Z)
Let's Sample Step by Step: Adaptive-Consistency for Efficient Reasoning and Coding with LLMs [60.58434523646137]
A popular approach for improving the correctness of output from large language models (LLMs) is Self-Consistency. We introduce Adaptive-Consistency, a cost-efficient, model-agnostic technique that dynamically adjusts the number of samples per question. Our experiments show that Adaptive-Consistency reduces sample budget by up to 7.9 times with an average accuracy drop of less than 0.1%.
arXiv Detail & Related papers (2023-05-19T17:49:25Z)
SatLM: Satisfiability-Aided Language Models Using Declarative Prompting [68.40726892904286]
We propose a new satisfiability-aided language modeling (SatLM) approach for improving the reasoning capabilities of large language models (LLMs) We use an LLM to generate a declarative task specification rather than an imperative program and leverage an off-the-shelf automated theorem prover to derive the final answer. We evaluate SATLM on 8 different datasets and show that it consistently outperforms program-aided LMs in the imperative paradigm.
arXiv Detail & Related papers (2023-05-16T17:55:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.