Fugu-MT 論文翻訳(概要): Tool-Augmented Policy Optimization: Synergizing Reasoning and Adaptive Tool Use with Reinforcement Learning

論文の概要: Tool-Augmented Policy Optimization: Synergizing Reasoning and Adaptive Tool Use with Reinforcement Learning

arxiv url: http://arxiv.org/abs/2510.07038v1
Date: Wed, 08 Oct 2025 14:04:27 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-09 16:41:20.538168
Title: Tool-Augmented Policy Optimization: Synergizing Reasoning and Adaptive Tool Use with Reinforcement Learning
Title（参考訳）: ツール強化政策最適化:強化学習を用いた推論と適応ツールの併用
Authors: Wenxun Wu, Yuanyang Li, Guhan Chen, Linyue Wang, Hongyang Chen,
Abstract要約: 大規模言語モデル(LLM)の最近の進歩はテスト時間スケーリングを普及させ、モデルが最終回答を生成する前にさらなる推論トークンを生成する。これらの手法は、数学的推論を含むベンチマークにおいて顕著な性能向上を示した。本稿では,マルチホップ推論と適応型ツールコール機能を統合した新しい強化学習フレームワークであるツール拡張ポリシー最適化(TAPO)を提案する。
参考スコア（独自算出の注目度）: 29.280386584974455
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in large language models (LLMs) have popularized test-time scaling, where models generate additional reasoning tokens before producing final answers. These approaches have demonstrated significant performance improvements on benchmarks involving mathematical reasoning. However, language models relying solely on direct inference still struggle with tasks demanding up-to-date knowledge or computational tools such as calculators and code interpreters for complex arithmetic operations. To overcome these limitations, we propose Tool-Augmented Policy Optimization (TAPO), a novel reinforcement learning framework that systematically integrates multi-hop reasoning with adaptive tool-calling capabilities. Our approach employs a modified version of Dynamic Sampling Policy Optimization (DAPO), a recently developed RL paradigm, which we adapt specifically for tool invocation scenarios, enabling models to dynamically interleave complex reasoning with on-demand tool usage (including search APIs and Python interpreters). To support this research, we introduce two new datasets: TAPO-easy-60K and TAPO-hard-18K, specifically designed to train and evaluate both fact-based reasoning and mathematical calculation capabilities. Our experiments on Qwen2.5-3B and Qwen2.5-7B models demonstrate the effectiveness of our approach, with both models achieving state-of-the-art performance on tasks requiring external knowledge and mathematical computation among methods with comparable parameters. Notably, TAPO achieves more efficient tool utilization than baseline methods while preventing excessive calls caused by reward hacking. These results highlight the significant potential of combining advanced reasoning with tool usage to enhance model performance in knowledge-intensive and computationally demanding tasks.
Abstract（参考訳）: 大規模言語モデル(LLM)の最近の進歩はテスト時間スケーリングを普及させ、モデルが最終回答を生成する前にさらなる推論トークンを生成する。これらの手法は、数学的推論を含むベンチマークにおいて顕著な性能向上を示した。しかし、直接推論のみに依存する言語モデルは、複雑な算術演算のための計算機やコードインタプリタのような最新の知識や計算ツールを必要とするタスクに苦戦している。これらの制約を克服するために,多視点推論と適応的ツール呼び出し機能とを体系的に統合する新しい強化学習フレームワークであるツール拡張ポリシー最適化(TAPO)を提案する。このアプローチでは、最近開発されたRLパラダイムである動的サンプリングポリシー最適化(DAPO)の修正版を採用し、ツールの実行シナリオに特化して、モデルがオンデマンドツールの使用(検索APIやPythonインタプリタを含む)で複雑な推論を動的にインターリーブできるようにする。本研究を支援するために, TAPO-easy-60KとTAPO-hard-18Kの2つの新しいデータセットを導入した。我々のQwen2.5-3BモデルとQwen2.5-7Bモデルに対する実験は、両モデルが外部知識を必要とするタスクに対して最先端の性能を達成し、同等のパラメータを持つメソッド間で数学的計算を行うことで、我々のアプローチの有効性を実証している。特にTAPOは,報酬ハッキングによる過剰な呼び出しを防止しつつ,ベースライン方式よりも効率的なツール利用を実現している。これらの結果は、高度な推論とツールの使用法を組み合わせることで、知識集約型および計算要求型タスクにおけるモデル性能を向上させる大きな可能性を浮き彫りにしている。

論文の概要: Tool-Augmented Policy Optimization: Synergizing Reasoning and Adaptive Tool Use with Reinforcement Learning

関連論文リスト