Fugu-MT 論文翻訳(概要): PORTool: Tool-Use LLM Training with Rewarded Tree

論文の概要: PORTool: Tool-Use LLM Training with Rewarded Tree

arxiv url: http://arxiv.org/abs/2510.26020v1
Date: Wed, 29 Oct 2025 23:28:53 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-31 16:05:09.602343
Title: PORTool: Tool-Use LLM Training with Rewarded Tree
Title（参考訳）: Portool: 逆木を用いたツール利用LDMトレーニング
Authors: Feijie Wu, Weiwu Zhu, Yuxiang Zhang, Soumya Chatterjee, Jiarong Zhu, Fan Mo, Rodin Luo, Jing Gao,
Abstract要約: 本稿では,ツール利用のLLMにおいて,正しい解答を得られる様々な軌跡を探索する強化学習法を提案する。異なる軌跡をまたいだ共有ステップは同じ報酬を受け取り、同じフォークの下の異なるステップは異なる報酬を受け取る。実験では17のツールを使用してユーザクエリに対処し、時間に敏感なトピックと時間に変化しないトピックの両方をカバーする。
参考スコア（独自算出の注目度）: 11.154654446183455
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current tool-use large language models (LLMs) are trained on static datasets, enabling them to interact with external tools and perform multi-step, tool-integrated reasoning, which produces tool-call trajectories. However, these models imitate how a query is resolved in a generic tool-call routine, thereby failing to explore possible solutions and demonstrating limited performance in an evolved, dynamic tool-call environment. In this work, we propose PORTool, a reinforcement learning (RL) method that encourages a tool-use LLM to explore various trajectories yielding the correct answer. Specifically, this method starts with generating multiple rollouts for a given query, and some of them share the first few tool-call steps, thereby forming a tree-like structure. Next, we assign rewards to each step, based on its ability to produce a correct answer and make successful tool calls. A shared step across different trajectories receives the same reward, while different steps under the same fork receive different rewards. Finally, these step-wise rewards are used to calculate fork-relative advantages, blended with trajectory-relative advantages, to train the LLM for tool use. The experiments utilize 17 tools to address user queries, covering both time-sensitive and time-invariant topics. We conduct ablation studies to systematically justify the necessity and the design robustness of step-wise rewards. Furthermore, we compare the proposed PORTool with other training approaches and demonstrate significant improvements in final accuracy and the number of tool-call steps.
Abstract（参考訳）: 現在のツール使用大型言語モデル(LLM)は、静的データセットに基づいてトレーニングされており、外部ツールと対話し、ツールコールトラジェクトリを生成する多段階のツール統合推論を実行することができる。しかし、これらのモデルは、クエリが一般的なツールコールルーチンでどのように解決されるかを模倣するため、可能なソリューションを探索できず、進化した動的ツールコール環境での限られたパフォーマンスを示す。そこで本研究では,ツール利用のLDMにおいて,正しい解答を得る様々な軌道を探索するための強化学習(RL)手法である Portool を提案する。具体的には、与えられたクエリに対して複数のロールアウトを生成することから始まり、最初のいくつかのツールコールステップを共有して、ツリーのような構造を形成する。次に、正しい回答を生成し、成功したツールコールを行う能力に基づいて、各ステップに報酬を割り当てます。異なる軌跡をまたいだ共有ステップは同じ報酬を受け取り、同じフォークの下の異なるステップは異なる報酬を受け取る。最後に、これらのステップワイズ報酬は、フォーク相対的な利点を計算し、軌道相対的な利点と組み合わせて、ツール使用のためにLLMを訓練するために使用される。実験では17のツールを使用してユーザクエリに対処し、時間に敏感なトピックと時間に変化しないトピックの両方をカバーする。段階的な報酬の必要性と設計の堅牢性を体系的に正当化するアブレーション研究を行う。さらに,提案した Portool と他のトレーニング手法を比較し,最終精度とツールコール回数を大幅に改善した。

論文の概要: PORTool: Tool-Use LLM Training with Rewarded Tree

関連論文リスト