Step Guided Reasoning: Improving Mathematical Reasoning using Guidance Generation and Step Reasoning
- URL: http://arxiv.org/abs/2410.19817v2
- Date: Mon, 17 Feb 2025 06:39:16 GMT
- Title: Step Guided Reasoning: Improving Mathematical Reasoning using Guidance Generation and Step Reasoning
- Authors: Lang Cao, Chao Peng, Renhong Chen, Wu Ning, Yingtian Zou, Yitong Li,
- Abstract summary: Step Guidied Reasoning is more stable and generalizable than few-shot methods.
We demonstrate the significant effect of Step Guidied Reasoning in augmenting mathematical performance in state-of-the-art language models.
- Score: 7.965282234763401
- License:
- Abstract: Mathematical reasoning has been challenging for large language models (LLMs). However, the introduction of step-by-step Chain-of-Thought (CoT) inference has significantly advanced the mathematical capabilities of LLMs. Despite this progress, current approaches either necessitate extensive inference datasets for training or depend on few-shot methods that frequently compromise computational accuracy. To address these bottlenecks in mathematical reasoning, we propose a novel method called Step Guidied Reasoning, which is more stable and generalizable than few-shot methods and does not involve further fine-tuning of the model. In this approach, LLMs reflect on small reasoning steps, similar to how humans deliberate and focus attention on what to do next. By incorporating this reflective process into the inference stage, LLMs can effectively guide their reasoning from one step to the next. Through extensive experiments, we demonstrate the significant effect of Step Guidied Reasoning in augmenting mathematical performance in state-of-the-art language models. Qwen2-72B-Instruct outperforms its math-specific counterpart, Qwen2.5-72B-Math-Instruct, on MMLU- STEM with a score of 90.9%, compared to 87.3%. The average scores of Qwen2-7B-Instruct and Qwen2-72B-Instruct increase from 27.1% to 36.3% and from 36.5% to 47.4% on the mathematics domain, respectively.
Related papers
- MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle Task [49.355810887265925]
We introduce MathFimer, a novel framework for mathematical reasoning step expansion.
We develop a specialized model, MathFimer-7B, on our carefully curated NuminaMath-FIM dataset.
We then apply these models to enhance existing mathematical reasoning datasets by inserting detailed intermediate steps into their solution chains.
arXiv Detail & Related papers (2025-02-17T11:22:24Z) - Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback [94.25162866972077]
Step-KTO is a training framework that combines process-level and outcome-level binary feedback.
Our experiments show that Step-KTO significantly improves both final answer accuracy and the quality of intermediate reasoning steps.
arXiv Detail & Related papers (2025-01-18T15:38:03Z) - BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning [83.03531832811386]
BoostStep is a method that enhances reasoning accuracy through step-aligned ICL examples.
It integrates seamlessly with chain-of-thought (CoT) and tree search algorithms.
It improves DeepSeek-R1-671B's performance on AIME by 2.2%, leveraging simple examples only from the MATH dataset.
arXiv Detail & Related papers (2025-01-06T18:59:13Z) - System-2 Mathematical Reasoning via Enriched Instruction Tuning [13.672967091915181]
Enriched Instruction Tuning (EIT) is a method that enriches existing human-annotated mathematical datasets by synergizing human and AI feedback.
EIT achieves an accuracy of 84.1% on GSM8K and 32.5% on MATH, surpassing state-of-the-art fine-tuning and prompting methods.
arXiv Detail & Related papers (2024-12-22T10:49:27Z) - Teaching-Inspired Integrated Prompting Framework: A Novel Approach for Enhancing Reasoning in Large Language Models [8.370453544530914]
Large Language Models (LLMs) exhibit impressive performance across various domains but still struggle with arithmetic reasoning tasks.
Recent work shows the effectiveness of prompt design methods in enhancing reasoning capabilities.
We propose a novel and effective Teaching-Inspired Integrated Framework, which emulates the instructional process of a teacher guiding students.
arXiv Detail & Related papers (2024-10-10T16:02:36Z) - Logic Contrastive Reasoning with Lightweight Large Language Model for Math Word Problems [0.0]
This study focuses on improving the performance of lightweight Large Language Models (LLMs) in mathematical reasoning tasks.
We introduce a novel method for measuring mathematical logic similarity and design an automatic screening mechanism.
By employing carefully crafted positive and negative example prompts, we guide the model towards adopting sound reasoning logic.
arXiv Detail & Related papers (2024-08-29T08:26:42Z) - MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models.
It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths.
It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z) - Masked Thought: Simply Masking Partial Reasoning Steps Can Improve Mathematical Reasoning Learning of Language Models [102.72940700598055]
In reasoning tasks, even a minor error can cascade into inaccurate results.
We develop a method that avoids introducing external resources, relying instead on perturbations to the input.
Our training approach randomly masks certain tokens within the chain of thought, a technique we found to be particularly effective for reasoning tasks.
arXiv Detail & Related papers (2024-03-04T16:21:54Z) - GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks.
One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly.
This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.
arXiv Detail & Related papers (2024-02-29T15:26:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.