Evaluating and Improving Tool-Augmented Computation-Intensive Math
Reasoning
- URL: http://arxiv.org/abs/2306.02408v1
- Date: Sun, 4 Jun 2023 17:02:59 GMT
- Title: Evaluating and Improving Tool-Augmented Computation-Intensive Math
Reasoning
- Authors: Beichen Zhang, Kun Zhou, Xilin Wei, Wayne Xin Zhao, Jing Sha, Shijin
Wang, Ji-Rong Wen
- Abstract summary: Chain-of-thought prompting(CoT) and tool augmentation have been validated as effective practices for improving large language models.
We propose a new approach that can deliberate the reasoning steps with tool interfaces, namely textbfDELI.
Experimental results on CARP and six other datasets show that the proposed DELI mostly outperforms competitive baselines.
- Score: 75.74103236299477
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Chain-of-thought prompting~(CoT) and tool augmentation have been validated in
recent work as effective practices for improving large language models~(LLMs)
to perform step-by-step reasoning on complex math-related tasks. However, most
existing math reasoning datasets may be not able to fully evaluate and analyze
the ability of LLMs in manipulating tools and performing reasoning, as they may
only require very few invocations of tools or miss annotations for evaluating
intermediate reasoning steps. To address the issue, we construct \textbf{CARP},
a new Chinese dataset consisting of 4,886 computation-intensive algebra
problems with formulated annotations on intermediate steps. In CARP, we test
four LLMs with CoT prompting, and find that they are all prone to make mistakes
at the early steps of the solution, leading to wrong answers. Based on this
finding, we propose a new approach that can deliberate the reasoning steps with
tool interfaces, namely \textbf{DELI}. In DELI, we first initialize a
step-by-step solution based on retrieved exemplars, then iterate two
deliberation procedures that check and refine the intermediate steps of the
generated solution, from the perspectives of tool manipulation and natural
language reasoning, until obtaining converged solutions or reaching the maximum
turn. Experimental results on CARP and six other datasets show that the
proposed DELI mostly outperforms competitive baselines, and can further boost
the performance of existing CoT methods. Our data and code are available in
\url{https://github.com/RUCAIBox/CARP}.
Related papers
- Improve Mathematical Reasoning in Language Models by Automated Process Supervision [22.72856086318912]
We propose a novel Monte Carlo Tree Search (MCTS) algorithm named textitOmegaPRM for the efficient collection of high-quality process supervision data.
We are able to collect over 1.5 million process supervision annotations to train a Process Reward Model (PRM)
We have enhanced the instruction tuned Gemini Pro model's math reasoning performance, achieving a 69.4% success rate on the MATH benchmark.
arXiv Detail & Related papers (2024-06-05T19:25:40Z) - Beyond Imitation: Learning Key Reasoning Steps from Dual Chain-of-Thoughts in Reasoning Distillation [24.272384832200522]
We propose mistaktextbfE-textbfDriven key reasontextbfIng step distillatextbfTion (textbfEDIT)
We design prompts to generate dual CoTs data with similar reasoning paths but divergent conclusions.
Experiments validate the effectiveness of EDIT across both in-domain and out-of-domain benchmark reasoning datasets.
arXiv Detail & Related papers (2024-05-30T06:32:11Z) - MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models.
It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths.
It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z) - From Large to Tiny: Distilling and Refining Mathematical Expertise for Math Word Problems with Weakly Supervision [12.023661884821554]
We introduce an innovative two-stage framework that adeptly transfers mathematical Expertise from large to tiny language models.
Our method fully leverages the semantic understanding capabilities during the searching 'problem-equation' pair.
It demonstrates significantly improved performance on the Math23K and Weak12K datasets compared to existing small model methods.
arXiv Detail & Related papers (2024-03-21T13:29:54Z) - Reverse That Number! Decoding Order Matters in Arithmetic Learning [49.5504492920404]
Our work introduces a novel strategy that reevaluates the digit order by prioritizing output from the least significant digit.
Compared to the previous state-of-the-art (SOTA) method, our findings reveal an overall improvement of in accuracy while requiring only a third of the tokens typically used during training.
arXiv Detail & Related papers (2024-03-09T09:04:53Z) - Efficient Tool Use with Chain-of-Abstraction Reasoning [65.18096363216574]
Large language models (LLMs) need to ground their reasoning to real-world knowledge.
There remains challenges for fine-tuning LLM agents to invoke tools in multi-step reasoning problems.
We propose a new method for LLMs to better leverage tools in multi-step reasoning.
arXiv Detail & Related papers (2024-01-30T21:53:30Z) - Guiding Language Model Math Reasoning with Planning Tokens [128.57605860640948]
We introduce planning tokens at the start of each reasoning step, serving as a guide for the model, and add their embeddings to the model parameters.
Our approach requires a negligible increase in trainable parameters (just 0.001%) and can be applied through either full fine-tuning or a more parameter-efficient scheme.
arXiv Detail & Related papers (2023-10-09T13:29:37Z) - SatLM: Satisfiability-Aided Language Models Using Declarative Prompting [68.40726892904286]
We propose a new satisfiability-aided language modeling (SatLM) approach for improving the reasoning capabilities of large language models (LLMs)
We use an LLM to generate a declarative task specification rather than an imperative program and leverage an off-the-shelf automated theorem prover to derive the final answer.
We evaluate SATLM on 8 different datasets and show that it consistently outperforms program-aided LMs in the imperative paradigm.
arXiv Detail & Related papers (2023-05-16T17:55:51Z) - Arguments to Key Points Mapping with Prompt-based Learning [0.0]
We propose two approaches to the argument-to-keypoint mapping task.
The first approach is to incorporate prompt engineering for fine-tuning the pre-trained language models.
The second approach utilizes prompt-based learning in PLMs to generate intermediary texts.
arXiv Detail & Related papers (2022-11-28T01:48:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.