ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving
- URL: http://arxiv.org/abs/2309.17452v4
- Date: Wed, 21 Feb 2024 12:59:22 GMT
- Title: ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving
- Authors: Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Minlie
Huang, Nan Duan, Weizhu Chen
- Abstract summary: ToRA is a series of Tool-integrated Reasoning Agents designed to solve challenging mathematical problems.
ToRA models significantly outperform open-source models on 10 mathematical reasoning datasets across all scales.
ToRA-Code-34B is the first open-source model that achieves an accuracy exceeding 50% on MATH.
- Score: 170.7899683843177
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models have made significant progress in various language
tasks, yet they still struggle with complex mathematics. In this paper, we
propose ToRA a series of Tool-integrated Reasoning Agents designed to solve
challenging mathematical problems by seamlessly integrating natural language
reasoning with the utilization of external tools (e.g., computation libraries
and symbolic solvers), thereby amalgamating the analytical prowess of language
and the computational efficiency of tools. To train ToRA, we curate interactive
tool-use trajectories on mathematical datasets, apply imitation learning on the
annotations, and propose output space shaping to further refine models'
reasoning behavior. As a result, ToRA models significantly outperform
open-source models on 10 mathematical reasoning datasets across all scales with
13%-19% absolute improvements on average. Notably, ToRA-7B reaches 44.6% on the
competition-level dataset MATH, surpassing the best open-source model
WizardMath-70B by 22% absolute. ToRA-Code-34B is also the first open-source
model that achieves an accuracy exceeding 50% on MATH, which significantly
outperforms GPT-4's CoT result, and is competitive with GPT-4 solving problems
with programs. Additionally, we conduct a comprehensive analysis of the
benefits and remaining challenges of tool interaction for mathematical
reasoning, providing valuable insights for future research.
Related papers
- Building Math Agents with Multi-Turn Iterative Preference Learning [56.71330214021884]
This paper studies the complementary direct preference learning approach to further improve model performance.
Existing direct preference learning algorithms are originally designed for the single-turn chat task.
We introduce a multi-turn direct preference learning framework, tailored for this context.
arXiv Detail & Related papers (2024-09-04T02:41:04Z) - Multi-tool Integration Application for Math Reasoning Using Large Language Model [1.4582633500696451]
This article proposes a novel multi tool application framework for mathematical reasoning.
It aims to achieve more comprehensive and accurate mathematical reasoning by utilizing the collaborative effect of large language models (LLMs) and multiple external tools.
arXiv Detail & Related papers (2024-08-22T06:27:10Z) - Benchmarking Large Language Models for Math Reasoning Tasks [12.91916443702145]
We compare seven state-of-the-art in-context learning algorithms for mathematical problem solving across five widely used mathematical datasets on four powerful foundation models.
Our results indicate that larger foundation models like GPT-4o and LLaMA 3-70B can solve mathematical reasoning independently from the concrete prompting strategy.
We open-source our benchmark code to support the integration of additional models in future research.
arXiv Detail & Related papers (2024-08-20T13:34:17Z) - MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models.
It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths.
It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z) - MATHSENSEI: A Tool-Augmented Large Language Model for Mathematical Reasoning [2.9104279358536647]
We present MathSensei, a tool-augmented large language model for mathematical reasoning.
We study the complementary benefits of the tools - knowledge retriever (Bing Web Search), program generator + executor (Python), and symbolic equation solver (Wolfram-Alpha API)
arXiv Detail & Related papers (2024-02-27T05:50:35Z) - MathGenie: Generating Synthetic Data with Question Back-translation for Enhancing Mathematical Reasoning of LLMs [38.127313175508746]
MathGenie is a novel method for generating diverse and reliable math problems from a small-scale problem-solution dataset.
Various pretrained models, ranging from 7B to 70B, are trained on the newly curated data to test the effectiveness of the proposed augmentation technique.
MathGenieLM-InternLM2 achieves an accuracy of 87.7% on GSM8K and 55.7% on MATH, securing the best overall score among open-source language models.
arXiv Detail & Related papers (2024-02-26T07:17:25Z) - Tool-Augmented Reward Modeling [58.381678612409]
We propose a tool-augmented preference modeling approach, named Themis, to address limitations by empowering RMs with access to external environments.
Our study delves into the integration of external tools into RMs, enabling them to interact with diverse external sources.
In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines.
arXiv Detail & Related papers (2023-10-02T09:47:40Z) - MAmmoTH: Building Math Generalist Models through Hybrid Instruction
Tuning [60.208045804204076]
We introduce MAmmoTH, a series of open-source large language models (LLMs) specifically tailored for general math problem-solving.
The MAmmoTH models are trained on MathInstruct, our meticulously curated instruction tuning dataset.
arXiv Detail & Related papers (2023-09-11T17:47:22Z) - Lila: A Unified Benchmark for Mathematical Reasoning [59.97570380432861]
LILA is a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions.
We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs.
We introduce BHASKARA, a general-purpose mathematical reasoning model trained on LILA.
arXiv Detail & Related papers (2022-10-31T17:41:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.