ToolRL: Reward is All Tool Learning Needs
- URL: http://arxiv.org/abs/2504.13958v1
- Date: Wed, 16 Apr 2025 21:45:32 GMT
- Title: ToolRL: Reward is All Tool Learning Needs
- Authors: Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, Heng Ji,
- Abstract summary: Large Language Models (LLMs) often undergo supervised fine-tuning (SFT) to acquire tool use capabilities.<n>Recent advancements in reinforcement learning (RL) have demonstrated promising reasoning and generalization abilities.<n>We present the first comprehensive study on reward design for tool selection and application tasks within the RL paradigm.
- Score: 54.16305891389931
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current Large Language Models (LLMs) often undergo supervised fine-tuning (SFT) to acquire tool use capabilities. However, SFT struggles to generalize to unfamiliar or complex tool use scenarios. Recent advancements in reinforcement learning (RL), particularly with R1-like models, have demonstrated promising reasoning and generalization abilities. Yet, reward design for tool use presents unique challenges: multiple tools may be invoked with diverse parameters, and coarse-grained reward signals, such as answer matching, fail to offer the finegrained feedback required for effective learning. In this work, we present the first comprehensive study on reward design for tool selection and application tasks within the RL paradigm. We systematically explore a wide range of reward strategies, analyzing their types, scales, granularity, and temporal dynamics. Building on these insights, we propose a principled reward design tailored for tool use tasks and apply it to train LLMs using Group Relative Policy Optimization (GRPO). Empirical evaluations across diverse benchmarks demonstrate that our approach yields robust, scalable, and stable training, achieving a 17% improvement over base models and a 15% gain over SFT models. These results highlight the critical role of thoughtful reward design in enhancing the tool use capabilities and generalization performance of LLMs. All the codes are released to facilitate future research.
Related papers
- Agentic Reinforced Policy Optimization [66.96989268893932]
Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks.<n>Current RL algorithms inadequately balance the models' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions.<n>We propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents.
arXiv Detail & Related papers (2025-07-26T07:53:11Z) - Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training [121.5858973157225]
We investigate the effects of prolonged reinforcement learning on a small language model across a diverse set of reasoning domains.<n>We introduce controlled KL regularization, clipping ratio, and periodic reference policy resets as critical components for unlocking long-term performance gains.<n>Our model achieves significant improvements over strong baselines, including +14.7% on math, +13.9% on coding, and +54.8% on logic puzzle tasks.
arXiv Detail & Related papers (2025-07-16T17:59:24Z) - VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection [47.259066449806866]
VisTA is a new reinforcement learning framework that empowers visual agents to dynamically explore, select, and combine tools from a diverse library based on empirical performance.<n>We show that VisTA achieves substantial performance gains over training-free baselines.<n>These results highlight VisTA's ability to enhance generalization, adaptively utilize diverse tools, and pave the way for flexible, experience-driven visual reasoning systems.
arXiv Detail & Related papers (2025-05-26T17:59:17Z) - Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning [63.31585771716123]
Large language models (LLMs) have shown remarkable reasoning capabilities via large-scale reinforcement learning (RL)<n>We introduce Tool-Star, an RL-based framework designed to empower LLMs to autonomously invoke multiple external tools during stepwise reasoning.<n>Tool-Star integrates six types of tools and incorporates systematic designs in both data synthesis and training.
arXiv Detail & Related papers (2025-05-22T09:00:19Z) - Nemotron-Research-Tool-N1: Exploring Tool-Using Language Models with Reinforced Reasoning [93.30252692375886]
Rule-based reinforcement learning can be used to enhance tool-calling in large language models.<n>Tool-N1-7B/14B clearly outperform GPT-4o on several major benchmarks.
arXiv Detail & Related papers (2025-04-25T02:55:21Z) - OTC: Optimal Tool Calls via Reinforcement Learning [87.28134636548705]
We propose a tool-integrated reward that jointly considers correctness and tool efficiency, promoting high tool productivity.
Our approach reduces tool calls by up to 73.1% and improves tool productivity by up to 229.4%, while maintaining comparable answer accuracy.
arXiv Detail & Related papers (2025-04-21T05:40:05Z) - ReTool: Reinforcement Learning for Strategic Tool Use in LLMs [27.07998056454784]
ReTool enhances long-form reasoning with tool-integrated learning.<n>Model achieves 67% accuracy with 400 training steps.<n>Remarkably, ReTool-32B attains 72.5% accuracy in extended settings.
arXiv Detail & Related papers (2025-04-15T18:10:22Z) - VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model [29.524164786422368]
Recently, DeepSeek R1 has shown that reinforcement learning can substantially improve the reasoning capabilities of Large Language Models (LLMs)
We investigate the extension of R1-style reinforcement learning to Vision-Language Models (VLMs)
We develop VLM-R1, a dedicated framework designed to harness RL for improving VLMs' performance on general vision-language tasks.
arXiv Detail & Related papers (2025-04-10T10:05:15Z) - FamilyTool: A Multi-hop Personalized Tool Use Benchmark [94.1158032740113]
We introduce FamilyTool, a novel benchmark grounded in a family-based knowledge graph (KG)
FamilyTool challenges Large Language Models with queries spanning 1 to 3 relational hops.
Experiments reveal significant performance gaps in state-of-the-art LLMs.
arXiv Detail & Related papers (2025-04-09T10:42:36Z) - ToolACE-R: Tool Learning with Adaptive Self-Refinement [84.69651852838794]
Tool learning allows Large Language Models to leverage external tools for solving complex user tasks.<n>We propose ToolACE-R, a novel method that introduces adaptive self-refinement for tool invocations.<n>Our results demonstrate the effectiveness of the proposed method, which is compatible with base models of various sizes.
arXiv Detail & Related papers (2025-04-02T06:38:56Z) - GenTool: Enhancing Tool Generalization in Language Models through Zero-to-One and Weak-to-Strong Simulation [37.85029997364506]
Large Language Models (LLMs) can enhance their capabilities as AI assistants by integrating external tools.<n>We present GenTool, a novel training framework that prepares LLMs for diverse generalization challenges in tool utilization.<n>Our approach addresses two fundamental dimensions critical for real-world applications: Zero-to-One Generalization and Weak-to-Strong Generalization.
arXiv Detail & Related papers (2025-02-26T09:54:33Z) - LLM With Tools: A Survey [0.0]
This paper delves into the methodology,challenges, and developments in the realm of teaching LLMs to use external tools.
We introduce a standardized paradigm for tool integration guided by a series of functions that map user instructions to actionable plans.
Our exploration reveals the various challenges encountered, such as tool invocation timing, selection accuracy, and the need for robust reasoning processes.
arXiv Detail & Related papers (2024-09-24T14:08:11Z) - LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error [54.954211216847135]
Existing large language models (LLMs) only reach a correctness rate in the range of 30% to 60%.
We propose a biologically inspired method for tool-augmented LLMs, simulated trial and error (STE)
STE orchestrates three key mechanisms for successful tool use behaviors in the biological system: trial and error, imagination, and memory.
arXiv Detail & Related papers (2024-03-07T18:50:51Z) - Look Before You Leap: Towards Decision-Aware and Generalizable Tool-Usage for Large Language Models [26.28459880766842]
We propose a decision-aware and generalizable tool-usage framework (DEER)
Specifically, we first construct the tool-usage samples with multiple decision branches via an automatic generation pipeline.
Our proposed DEER is effective and significantly outperforms baselines across various datasets.
arXiv Detail & Related papers (2024-02-26T16:11:03Z) - Efficient Tool Use with Chain-of-Abstraction Reasoning [63.08202389132155]
Large language models (LLMs) need to ground their reasoning to real-world knowledge.<n>There remains challenges for fine-tuning LLM agents to invoke tools in multi-step reasoning problems.<n>We propose a new method for LLMs to better leverage tools in multi-step reasoning.
arXiv Detail & Related papers (2024-01-30T21:53:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.