ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
- URL: http://arxiv.org/abs/2504.11536v2
- Date: Thu, 17 Apr 2025 16:46:07 GMT
- Title: ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
- Authors: Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, Wanjun Zhong,
- Abstract summary: ReTool enhances long-form reasoning with tool-integrated learning.<n>Model achieves 67% accuracy with 400 training steps.<n>Remarkably, ReTool-32B attains 72.5% accuracy in extended settings.
- Score: 27.07998056454784
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While reasoning models (e.g., DeepSeek R1) trained with reinforcement learning (RL), excel in textual reasoning, they struggle in scenarios requiring structured problem-solving, such as geometric reasoning, concise computation, or complex equation solving-areas where computational tools like code interpreters (CI) demonstrate distinct advantages. To bridge this gap, we propose ReTool, which enhances long-form reasoning with tool-integrated learning, including two key features: (1) dynamic interleaving of real-time code execution within natural language reasoning processes, and (2) an automated RL paradigm that allows policy rollouts with multi-turn real-time code execution and teaches the model in learning when and how to invoke tools based on outcome feedback. ReTool employs a systematic training framework, beginning with synthetic cold-start data generation to produce code-augmented long-form reasoning traces for fine-tuning base models. Subsequent RL training leverages task outcomes as rewards to iteratively refine the model's tool use strategy, enabling autonomous discovery of optimal tool invocation patterns without human priors. Experiments on the challenging MATH Olympiad benchmark AIME demonstrate ReTool's superiority: Our 32B model achieves 67% accuracy with 400 training steps, outperforming text-based RL baseline (40% accuracy, 1080 steps) in efficiency and performance. Remarkably, ReTool-32B attains 72.5% accuracy in extended settings, surpassing OpenAI's o1-preview by 27.9%. Further analysis reveals emergent behaviors such as code self-correction, signaling an ''aha moment'' in which the model autonomously masters adaptive tool use. These findings highlight the promise of outcome-driven tool integration for advancing complex mathematical reasoning and offer new insights into hybrid neuro-symbolic systems.
Related papers
- Nemotron-Research-Tool-N1: Tool-Using Language Models with Reinforced Reasoning [93.30252692375886]
We develop a series of tool-using language models using a similar training paradigm.
Nemotron-Research-Tool-N1 is optimized with a binary reward that evaluates only the structural validity and functional correctness of tool invocations.
Experiments show that Nemotron-Research-Tool-N1-7B and Nemotron-Research-Tool-N1-14B, built on Qwen-2.5-7B/14B-Instruct, achieve state-of-the-art results.
arXiv Detail & Related papers (2025-04-25T02:55:21Z) - OTC: Optimal Tool Calls via Reinforcement Learning [87.28134636548705]
We propose a tool-integrated reward that jointly considers correctness and tool efficiency, promoting high tool productivity.
Our approach reduces tool calls by up to 73.1% and improves tool productivity by up to 229.4%, while maintaining comparable answer accuracy.
arXiv Detail & Related papers (2025-04-21T05:40:05Z) - ToolRL: Reward is All Tool Learning Needs [54.16305891389931]
Large Language Models (LLMs) often undergo supervised fine-tuning (SFT) to acquire tool use capabilities.
Recent advancements in reinforcement learning (RL) have demonstrated promising reasoning and generalization abilities.
We present the first comprehensive study on reward design for tool selection and application tasks within the RL paradigm.
arXiv Detail & Related papers (2025-04-16T21:45:32Z) - ToolACE-R: Tool Learning with Adaptive Self-Refinement [84.69651852838794]
Tool learning allows Large Language Models to leverage external tools for solving complex user tasks.<n>We propose ToolACE-R, a novel method that introduces adaptive self-refinement for tool invocations.<n>Our results demonstrate the effectiveness of the proposed method, which is compatible with base models of various sizes.
arXiv Detail & Related papers (2025-04-02T06:38:56Z) - ToRL: Scaling Tool-Integrated RL [25.477841726836836]
ToRL is a framework for training large language models to autonomously use computational tools.<n>ToRL allows models to explore and discover optimal strategies for tool use.<n>Experiments with Qwen2.5-Math models show significant improvements.
arXiv Detail & Related papers (2025-03-30T10:16:25Z) - Learning Autonomous Code Integration for Math Language Models [30.057052324461534]
We propose a novel framework that synergizes structured exploration (E-step) with off-policy optimization (M-step) to create a self-reinforcing cycle between metacognitive tool-use decisions and evolving capabilities.<n>Our 7B model improves over 11% on MATH500 and 9.4% on AIME without o1-like CoT.
arXiv Detail & Related papers (2025-02-02T06:32:23Z) - iTool: Boosting Tool Use of Large Language Models via Iterative Reinforced Fine-Tuning [39.65877861652369]
Augmenting large language models with external tools is a promising approach to enhancing their capabilities.<n>We show that training gains significantly decay as synthetic data increases.<n>We propose an iterative reinforced fine-tuning strategy designed to alleviate these challenges.
arXiv Detail & Related papers (2025-01-15T04:52:34Z) - ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark [0.0]
We introduce ToolComp, a benchmark designed to evaluate multi-step tool-use reasoning.
ToolComp is developed through a collaboration between models and human annotators.
We generate synthetic training data to compare the performance of outcome-supervised reward models with process-supervised reward models.
arXiv Detail & Related papers (2025-01-02T15:10:52Z) - Tool-Augmented Reward Modeling [58.381678612409]
We propose a tool-augmented preference modeling approach, named Themis, to address limitations by empowering RMs with access to external environments.
Our study delves into the integration of external tools into RMs, enabling them to interact with diverse external sources.
In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines.
arXiv Detail & Related papers (2023-10-02T09:47:40Z) - End-to-End Meta-Bayesian Optimisation with Transformer Neural Processes [52.818579746354665]
This paper proposes the first end-to-end differentiable meta-BO framework that generalises neural processes to learn acquisition functions via transformer architectures.
We enable this end-to-end framework with reinforcement learning (RL) to tackle the lack of labelled acquisition data.
arXiv Detail & Related papers (2023-05-25T10:58:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.