ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving
- URL: http://arxiv.org/abs/2309.17452v4
- Date: Wed, 21 Feb 2024 12:59:22 GMT
- Title: ToRA: A Tool-Integrated Reasoning Agent for Mathematical Problem Solving
- Authors: Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Minlie
Huang, Nan Duan, Weizhu Chen
- Abstract summary: ToRA is a series of Tool-integrated Reasoning Agents designed to solve challenging mathematical problems.
ToRA models significantly outperform open-source models on 10 mathematical reasoning datasets across all scales.
ToRA-Code-34B is the first open-source model that achieves an accuracy exceeding 50% on MATH.
- Score: 170.7899683843177
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models have made significant progress in various language
tasks, yet they still struggle with complex mathematics. In this paper, we
propose ToRA a series of Tool-integrated Reasoning Agents designed to solve
challenging mathematical problems by seamlessly integrating natural language
reasoning with the utilization of external tools (e.g., computation libraries
and symbolic solvers), thereby amalgamating the analytical prowess of language
and the computational efficiency of tools. To train ToRA, we curate interactive
tool-use trajectories on mathematical datasets, apply imitation learning on the
annotations, and propose output space shaping to further refine models'
reasoning behavior. As a result, ToRA models significantly outperform
open-source models on 10 mathematical reasoning datasets across all scales with
13%-19% absolute improvements on average. Notably, ToRA-7B reaches 44.6% on the
competition-level dataset MATH, surpassing the best open-source model
WizardMath-70B by 22% absolute. ToRA-Code-34B is also the first open-source
model that achieves an accuracy exceeding 50% on MATH, which significantly
outperforms GPT-4's CoT result, and is competitive with GPT-4 solving problems
with programs. Additionally, we conduct a comprehensive analysis of the
benefits and remaining challenges of tool interaction for mathematical
reasoning, providing valuable insights for future research.
Related papers
- MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models.
It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths.
It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z) - Mathify: Evaluating Large Language Models on Mathematical Problem Solving Tasks [34.09857430966818]
We introduce an extensive mathematics dataset called "MathQuest" sourced from the 11th and 12th standard Mathematics NCERT textbooks.
We conduct fine-tuning experiments with three prominent large language models: LLaMA-2, WizardMath, and MAmmoTH.
Our experiments reveal that among the three models, MAmmoTH-13B emerges as the most proficient, achieving the highest level of competence in solving the presented mathematical problems.
arXiv Detail & Related papers (2024-04-19T08:45:42Z) - MATHSENSEI: A Tool-Augmented Large Language Model for Mathematical Reasoning [2.9104279358536647]
We present MathSensei, a tool-augmented large language model for mathematical reasoning.
We study the complementary benefits of the tools - knowledge retriever (Bing Web Search), program generator + executor (Python), and symbolic equation solver (Wolfram-Alpha API)
arXiv Detail & Related papers (2024-02-27T05:50:35Z) - MathGenie: Generating Synthetic Data with Question Back-translation for
Enhancing Mathematical Reasoning of LLMs [39.769464414087935]
MathGenie is a novel method for generating diverse and reliable math problems from a small-scale problem-solution dataset.
Various pretrained models, ranging from 7B to 70B, are trained on the newly curated data to test the effectiveness of the proposed augmentation technique.
MathGenieLM-InternLM2 achieves an accuracy of 87.7% on GSM8K and 55.7% on MATH, securing the best overall score among open-source language models.
arXiv Detail & Related papers (2024-02-26T07:17:25Z) - MathVista: Evaluating Mathematical Reasoning of Foundation Models in
Visual Contexts [170.01089233942594]
MathVista is a benchmark designed to combine challenges from diverse mathematical and visual tasks.
The best-performing GPT-4V model achieves an overall accuracy of 49.9%, substantially outperforming Bard, the second-best performer, by 15.1%.
GPT-4V still falls short of human performance by 10.4%, as it often struggles to understand complex figures and perform rigorous reasoning.
arXiv Detail & Related papers (2023-10-03T17:57:24Z) - Tool-Augmented Reward Modeling [58.381678612409]
We propose a tool-augmented preference modeling approach, named Themis, to address limitations by empowering RMs with access to external environments.
Our study delves into the integration of external tools into RMs, enabling them to interact with diverse external sources.
In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines.
arXiv Detail & Related papers (2023-10-02T09:47:40Z) - MAmmoTH: Building Math Generalist Models through Hybrid Instruction
Tuning [60.208045804204076]
We introduce MAmmoTH, a series of open-source large language models (LLMs) specifically tailored for general math problem-solving.
The MAmmoTH models are trained on MathInstruct, our meticulously curated instruction tuning dataset.
arXiv Detail & Related papers (2023-09-11T17:47:22Z) - Lila: A Unified Benchmark for Mathematical Reasoning [59.97570380432861]
LILA is a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions.
We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs.
We introduce BHASKARA, a general-purpose mathematical reasoning model trained on LILA.
arXiv Detail & Related papers (2022-10-31T17:41:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.