Evaluating and Improving Tool-Augmented Computation-Intensive Math
Reasoning
- URL: http://arxiv.org/abs/2306.02408v1
- Date: Sun, 4 Jun 2023 17:02:59 GMT
- Title: Evaluating and Improving Tool-Augmented Computation-Intensive Math
Reasoning
- Authors: Beichen Zhang, Kun Zhou, Xilin Wei, Wayne Xin Zhao, Jing Sha, Shijin
Wang, Ji-Rong Wen
- Abstract summary: Chain-of-thought prompting(CoT) and tool augmentation have been validated as effective practices for improving large language models.
We propose a new approach that can deliberate the reasoning steps with tool interfaces, namely textbfDELI.
Experimental results on CARP and six other datasets show that the proposed DELI mostly outperforms competitive baselines.
- Score: 75.74103236299477
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Chain-of-thought prompting~(CoT) and tool augmentation have been validated in
recent work as effective practices for improving large language models~(LLMs)
to perform step-by-step reasoning on complex math-related tasks. However, most
existing math reasoning datasets may be not able to fully evaluate and analyze
the ability of LLMs in manipulating tools and performing reasoning, as they may
only require very few invocations of tools or miss annotations for evaluating
intermediate reasoning steps. To address the issue, we construct \textbf{CARP},
a new Chinese dataset consisting of 4,886 computation-intensive algebra
problems with formulated annotations on intermediate steps. In CARP, we test
four LLMs with CoT prompting, and find that they are all prone to make mistakes
at the early steps of the solution, leading to wrong answers. Based on this
finding, we propose a new approach that can deliberate the reasoning steps with
tool interfaces, namely \textbf{DELI}. In DELI, we first initialize a
step-by-step solution based on retrieved exemplars, then iterate two
deliberation procedures that check and refine the intermediate steps of the
generated solution, from the perspectives of tool manipulation and natural
language reasoning, until obtaining converged solutions or reaching the maximum
turn. Experimental results on CARP and six other datasets show that the
proposed DELI mostly outperforms competitive baselines, and can further boost
the performance of existing CoT methods. Our data and code are available in
\url{https://github.com/RUCAIBox/CARP}.
Related papers
- MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle Task [49.355810887265925]
We introduce MathFimer, a novel framework for mathematical reasoning step expansion.
We develop a specialized model, MathFimer-7B, on our carefully curated NuminaMath-FIM dataset.
We then apply these models to enhance existing mathematical reasoning datasets by inserting detailed intermediate steps into their solution chains.
arXiv Detail & Related papers (2025-02-17T11:22:24Z) - BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning [83.03531832811386]
BoostStep is a method that enhances reasoning accuracy through step-aligned ICL examples.
It integrates seamlessly with chain-of-thought (CoT) and tree search algorithms.
It improves DeepSeek-R1-671B's performance on AIME by 2.2%, leveraging simple examples only from the MATH dataset.
arXiv Detail & Related papers (2025-01-06T18:59:13Z) - Enhancing Mathematical Reasoning in LLMs with Background Operators [36.14500963096528]
We develop a Prolog solution that includes problem-specific predicates and intermediate predicates derived from background operators.
For efficient data augmentation, we apply K-fold cross-validated self-training.
Our experimental results demonstrate that 5-fold crossvalidated self-training effectively identifies new, accurate Prolog solutions.
arXiv Detail & Related papers (2024-12-05T12:24:54Z) - Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning.
LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors.
We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z) - BEATS: Optimizing LLM Mathematical Capabilities with BackVerify and Adaptive Disambiguate based Efficient Tree Search [22.672130194493793]
Large Language Models (LLMs) have exhibited exceptional performance across a broad range of tasks and domains.
They still encounter difficulties in solving mathematical problems due to the rigorous and logical nature of mathematics.
We propose a novel approach, BEATS, to enhance mathematical problem-solving abilities.
arXiv Detail & Related papers (2024-09-26T15:47:42Z) - Beyond Imitation: Learning Key Reasoning Steps from Dual Chain-of-Thoughts in Reasoning Distillation [24.272384832200522]
We propose mistaktextbfE-textbfDriven key reasontextbfIng step distillatextbfTion (textbfEDIT)
We design prompts to generate dual CoTs data with similar reasoning paths but divergent conclusions.
Experiments validate the effectiveness of EDIT across both in-domain and out-of-domain benchmark reasoning datasets.
arXiv Detail & Related papers (2024-05-30T06:32:11Z) - MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models.
It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths.
It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z) - From Large to Tiny: Distilling and Refining Mathematical Expertise for Math Word Problems with Weakly Supervision [12.023661884821554]
We introduce an innovative two-stage framework that adeptly transfers mathematical Expertise from large to tiny language models.
Our method fully leverages the semantic understanding capabilities during the searching 'problem-equation' pair.
It demonstrates significantly improved performance on the Math23K and Weak12K datasets compared to existing small model methods.
arXiv Detail & Related papers (2024-03-21T13:29:54Z) - SatLM: Satisfiability-Aided Language Models Using Declarative Prompting [68.40726892904286]
We propose a new satisfiability-aided language modeling (SatLM) approach for improving the reasoning capabilities of large language models (LLMs)
We use an LLM to generate a declarative task specification rather than an imperative program and leverage an off-the-shelf automated theorem prover to derive the final answer.
We evaluate SATLM on 8 different datasets and show that it consistently outperforms program-aided LMs in the imperative paradigm.
arXiv Detail & Related papers (2023-05-16T17:55:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.