Tool-Augmented Policy Optimization: Synergizing Reasoning and Adaptive Tool Use with Reinforcement Learning
- URL: http://arxiv.org/abs/2510.07038v1
- Date: Wed, 08 Oct 2025 14:04:27 GMT
- Title: Tool-Augmented Policy Optimization: Synergizing Reasoning and Adaptive Tool Use with Reinforcement Learning
- Authors: Wenxun Wu, Yuanyang Li, Guhan Chen, Linyue Wang, Hongyang Chen,
- Abstract summary: Recent advances in large language models (LLMs) have popularized test-time scaling, where models generate additional reasoning tokens before producing final answers.<n>These approaches have demonstrated significant performance improvements on benchmarks involving mathematical reasoning.<n>We propose Tool-Augmented Policy Optimization (TAPO), a novel reinforcement learning framework that integrates multi-hop reasoning with adaptive tool-calling capabilities.
- Score: 29.280386584974455
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in large language models (LLMs) have popularized test-time scaling, where models generate additional reasoning tokens before producing final answers. These approaches have demonstrated significant performance improvements on benchmarks involving mathematical reasoning. However, language models relying solely on direct inference still struggle with tasks demanding up-to-date knowledge or computational tools such as calculators and code interpreters for complex arithmetic operations. To overcome these limitations, we propose Tool-Augmented Policy Optimization (TAPO), a novel reinforcement learning framework that systematically integrates multi-hop reasoning with adaptive tool-calling capabilities. Our approach employs a modified version of Dynamic Sampling Policy Optimization (DAPO), a recently developed RL paradigm, which we adapt specifically for tool invocation scenarios, enabling models to dynamically interleave complex reasoning with on-demand tool usage (including search APIs and Python interpreters). To support this research, we introduce two new datasets: TAPO-easy-60K and TAPO-hard-18K, specifically designed to train and evaluate both fact-based reasoning and mathematical calculation capabilities. Our experiments on Qwen2.5-3B and Qwen2.5-7B models demonstrate the effectiveness of our approach, with both models achieving state-of-the-art performance on tasks requiring external knowledge and mathematical computation among methods with comparable parameters. Notably, TAPO achieves more efficient tool utilization than baseline methods while preventing excessive calls caused by reward hacking. These results highlight the significant potential of combining advanced reasoning with tool usage to enhance model performance in knowledge-intensive and computationally demanding tasks.
Related papers
- AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning [66.24374176797075]
We introduce textbfAdaReasoner, a family of multimodal models that learn tool use as a general reasoning skill rather than as tool-specific or explicitly supervised behavior.<n>AdaReasoner is enabled by (i) a scalable data curation pipeline exposing models to long-horizon, multi-step tool interactions; (ii) Tool-GRPO, a reinforcement learning algorithm that prioritizes tool selection and sequencing based on end-task success; and (iii) an adaptive learning mechanism that dynamically regulates tool usage.
arXiv Detail & Related papers (2026-01-26T16:04:43Z) - AgentMath: Empowering Mathematical Reasoning for Large Language Models via Tool-Augmented Agent [80.83250816918861]
Large Reasoning Models (LRMs) like o3 and DeepSeek-R1 have achieved remarkable progress in natural language reasoning with long chain-of-thought.<n>However, they remain computationally inefficient and struggle with accuracy when solving problems requiring complex mathematical operations.<n>We present AgentMath, an agent framework that seamlessly integrates language models' reasoning capabilities with code interpreters' computational precision.
arXiv Detail & Related papers (2025-12-23T19:57:49Z) - Toward Effective Tool-Integrated Reasoning via Self-Evolved Preference Learning [68.89572566071575]
Tool-Integrated Reasoning (TIR) enables large language models (LLMs) to improve their internal reasoning ability by integrating external tools.<n>We propose Tool-Light, a framework designed to encourage LLMs to perform TIR efficiently and accurately.<n> Experimental results on 10 datasets demonstrate the effectiveness of Tool-Light.
arXiv Detail & Related papers (2025-09-27T12:53:37Z) - THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning [25.605096023894834]
Large Language Models (LLMs) have made remarkable progress in mathematical reasoning.<n>Despite recent advances, existing methods struggle with three key challenges.<n>We propose THOR (Tool-Integrated Hierarchical Optimization via RL) to overcome these limitations.<n>Our approach demonstrates strong generalization across diverse models, performing effectively in both reasoning and non-reasoning models.
arXiv Detail & Related papers (2025-09-17T07:16:12Z) - Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments [70.42705564227548]
We propose an automated environment construction pipeline for large language models (LLMs)<n>This enables the creation of high-quality training environments that provide detailed and measurable feedback without relying on external tools.<n>We also introduce a verifiable reward mechanism that evaluates both the precision of tool use and the completeness of task execution.
arXiv Detail & Related papers (2025-08-12T09:45:19Z) - Reinforcement learning fine-tuning of language model for instruction following and math reasoning [0.0]
This study investigates the effectiveness of reinforcement learning techniques on a compact language model (Qwen2.5-0.5B Base)<n>We compare supervised fine-tuning (SFT), Direct Preference Optimization (DPO) using preference-labeled data, and Reinforce Leave-One-Out (RLOO) with reward models.<n>Experiments show that RLOO with DeBERTa reward modeling achieves the best alignment, while DPO provides strong and consistent results.
arXiv Detail & Related papers (2025-06-11T22:49:42Z) - Acting Less is Reasoning More! Teaching Model to Act Efficiently [87.28134636548705]
Tool-integrated reasoning augments large language models with the ability to invoke external tools to solve tasks.<n>Current approaches typically optimize only for final correctness without considering the efficiency or necessity of external tool use.<n>We propose a framework that encourages models to produce accurate answers with minimal tool calls.<n>Our approach reduces tool calls by up to 68.3% and improves tool productivity by up to 215.4%, while maintaining comparable answer accuracy.
arXiv Detail & Related papers (2025-04-21T05:40:05Z) - ReTool: Reinforcement Learning for Strategic Tool Use in LLMs [27.07998056454784]
ReTool enhances long-form reasoning with tool-integrated learning.<n>Model achieves 67% accuracy with 400 training steps.<n>Remarkably, ReTool-32B attains 72.5% accuracy in extended settings.
arXiv Detail & Related papers (2025-04-15T18:10:22Z) - ToolACE-R: Model-aware Iterative Training and Adaptive Refinement for Tool Learning [84.69651852838794]
Tool learning allows Large Language Models (LLMs) to leverage external tools for solving complex user tasks.<n>We propose ToolACE-R, a novel framework that includes both model-aware iterative training and adaptive refinement for tool learning.<n>We conduct extensive experiments across several benchmark datasets, showing that ToolACE-R achieves competitive performance compared to advanced API-based models.
arXiv Detail & Related papers (2025-04-02T06:38:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.