Evaluating and Improving Tool-Augmented Computation-Intensive Math
Reasoning
- URL: http://arxiv.org/abs/2306.02408v1
- Date: Sun, 4 Jun 2023 17:02:59 GMT
- Title: Evaluating and Improving Tool-Augmented Computation-Intensive Math
Reasoning
- Authors: Beichen Zhang, Kun Zhou, Xilin Wei, Wayne Xin Zhao, Jing Sha, Shijin
Wang, Ji-Rong Wen
- Abstract summary: Chain-of-thought prompting(CoT) and tool augmentation have been validated as effective practices for improving large language models.
We propose a new approach that can deliberate the reasoning steps with tool interfaces, namely textbfDELI.
Experimental results on CARP and six other datasets show that the proposed DELI mostly outperforms competitive baselines.
- Score: 75.74103236299477
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Chain-of-thought prompting~(CoT) and tool augmentation have been validated in
recent work as effective practices for improving large language models~(LLMs)
to perform step-by-step reasoning on complex math-related tasks. However, most
existing math reasoning datasets may be not able to fully evaluate and analyze
the ability of LLMs in manipulating tools and performing reasoning, as they may
only require very few invocations of tools or miss annotations for evaluating
intermediate reasoning steps. To address the issue, we construct \textbf{CARP},
a new Chinese dataset consisting of 4,886 computation-intensive algebra
problems with formulated annotations on intermediate steps. In CARP, we test
four LLMs with CoT prompting, and find that they are all prone to make mistakes
at the early steps of the solution, leading to wrong answers. Based on this
finding, we propose a new approach that can deliberate the reasoning steps with
tool interfaces, namely \textbf{DELI}. In DELI, we first initialize a
step-by-step solution based on retrieved exemplars, then iterate two
deliberation procedures that check and refine the intermediate steps of the
generated solution, from the perspectives of tool manipulation and natural
language reasoning, until obtaining converged solutions or reaching the maximum
turn. Experimental results on CARP and six other datasets show that the
proposed DELI mostly outperforms competitive baselines, and can further boost
the performance of existing CoT methods. Our data and code are available in
\url{https://github.com/RUCAIBox/CARP}.
Related papers
- ReasonAgain: Using Extractable Symbolic Programs to Evaluate Mathematical Reasoning [54.70811660561151]
Existing math datasets evaluate the reasoning abilities of large language models (LLMs) by either using the final answer or the intermediate reasoning steps derived from static examples.
We seek to use symbolic programs as a means for automated evaluation if a model can consistently produce correct final answers across various inputs to the program.
We observe significant accuracy drops using our proposed evaluation compared with original static examples, suggesting the fragility of math reasoning in state-of-the-art LLMs.
arXiv Detail & Related papers (2024-10-24T18:02:37Z) - Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning.
LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors.
We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z) - BEATS: Optimizing LLM Mathematical Capabilities with BackVerify and Adaptive Disambiguate based Efficient Tree Search [22.672130194493793]
Large Language Models (LLMs) have exhibited exceptional performance across a broad range of tasks and domains.
They still encounter difficulties in solving mathematical problems due to the rigorous and logical nature of mathematics.
We propose a novel approach, BEATS, to enhance mathematical problem-solving abilities.
arXiv Detail & Related papers (2024-09-26T15:47:42Z) - To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning [55.52872152909785]
Chain-of-thought (CoT) via prompting is the de facto method for eliciting reasoning capabilities from large language models (LLMs)
We show that CoT gives strong performance benefits primarily on tasks involving math or logic, with much smaller gains on other types of tasks.
arXiv Detail & Related papers (2024-09-18T17:55:00Z) - Beyond Imitation: Learning Key Reasoning Steps from Dual Chain-of-Thoughts in Reasoning Distillation [24.272384832200522]
We propose mistaktextbfE-textbfDriven key reasontextbfIng step distillatextbfTion (textbfEDIT)
We design prompts to generate dual CoTs data with similar reasoning paths but divergent conclusions.
Experiments validate the effectiveness of EDIT across both in-domain and out-of-domain benchmark reasoning datasets.
arXiv Detail & Related papers (2024-05-30T06:32:11Z) - MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models.
It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths.
It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z) - From Large to Tiny: Distilling and Refining Mathematical Expertise for Math Word Problems with Weakly Supervision [12.023661884821554]
We introduce an innovative two-stage framework that adeptly transfers mathematical Expertise from large to tiny language models.
Our method fully leverages the semantic understanding capabilities during the searching 'problem-equation' pair.
It demonstrates significantly improved performance on the Math23K and Weak12K datasets compared to existing small model methods.
arXiv Detail & Related papers (2024-03-21T13:29:54Z) - SatLM: Satisfiability-Aided Language Models Using Declarative Prompting [68.40726892904286]
We propose a new satisfiability-aided language modeling (SatLM) approach for improving the reasoning capabilities of large language models (LLMs)
We use an LLM to generate a declarative task specification rather than an imperative program and leverage an off-the-shelf automated theorem prover to derive the final answer.
We evaluate SATLM on 8 different datasets and show that it consistently outperforms program-aided LMs in the imperative paradigm.
arXiv Detail & Related papers (2023-05-16T17:55:51Z) - Arguments to Key Points Mapping with Prompt-based Learning [0.0]
We propose two approaches to the argument-to-keypoint mapping task.
The first approach is to incorporate prompt engineering for fine-tuning the pre-trained language models.
The second approach utilizes prompt-based learning in PLMs to generate intermediary texts.
arXiv Detail & Related papers (2022-11-28T01:48:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.