Fugu-MT 論文翻訳(概要): Learning When Not to Act: Mitigating Tool Abuse in Agentic Reinforcement Learning

論文の概要: Learning When Not to Act: Mitigating Tool Abuse in Agentic Reinforcement Learning

arxiv url: http://arxiv.org/abs/2606.02132v2
Date: Tue, 02 Jun 2026 07:53:40 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-03 18:57:50.556524
Title: Learning When Not to Act: Mitigating Tool Abuse in Agentic Reinforcement Learning
Title（参考訳）: 行動しないときの学習: エージェント強化学習におけるツール使用の軽減
Authors: Liuji Chen, Dianxing Tang, Xing Shi, Dingshuo Chen, Qiang Liu, Shu Wu, Liang Wang,
Abstract要約: エージェント強化学習は、内部推論によって解決可能なクエリであっても、モデルが外部ツールを過剰に使用するツールの乱用を引き起こす可能性がある。本稿では,効率的なエージェントポリシー最適化フレームワークEAPOを提案する。 GRPOと比較して、EAPOは平均パフォーマンスを10.45%、7.27%、9.69%改善し、平均ツールコールを18.33%、18.33%、および24.59%削減した。
参考スコア（独自算出の注目度）: 26.34952204312613
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Agentic reinforcement learning can induce tool abuse, where models overuse external tools even for queries solvable by internal reasoning. Existing approaches mitigate this issue with uniform tool-use penalties or hard limits, which reduce tool frequency but may also suppress useful tool-assisted exploration. We propose EAPO, an Efficient Agentic Policy Optimization framework that learns selective tool use. EAPO introduces tool-free trajectories into each rollout group, applies difficulty-aware reward shaping to penalize redundant tool calls mainly on easier queries, and uses confidence-aware token reweighting to improve policy learning. Across nine mathematical and knowledge-intensive reasoning benchmarks, EAPO consistently improves the accuracy efficiency trade-off on Qwen2.5-3B, Qwen2.5-7B, and Llama3.1-8B. Compared with GRPO, EAPO improves average performance by 10.45%, 7.27%, and 9.69%, while reducing average tool calls by 18.33%, 18.33%, and 24.59%, respectively. These results show that agents can learn when not to use tools without compromising tool-integrated reasoning.
Abstract（参考訳）: エージェント強化学習は、内部推論によって解決可能なクエリに対しても、モデルが外部ツールを過剰に使用するツールの乱用を引き起こす可能性がある。既存のアプローチは、ツール使用の罰則やハードリミットによってこの問題を緩和し、ツールの頻度を減少させるが、ツールアシスト探索を効果的に抑制する可能性がある。本稿では,効率的なエージェントポリシー最適化フレームワークEAPOを提案する。 EAPOは、各ロールアウトグループにツールフリーなトラジェクトリを導入し、より簡単なクエリを中心に冗長なツールコールをペナルティ化するために、難易度対応の報酬シェーピングを適用し、ポリシー学習を改善するために、自信対応のトークン再重み付けを使用する。 9つの数学的および知識集約的な推論ベンチマークの中で、EAPOはQwen2.5-3B、Qwen2.5-7B、Llama3.1-8Bの精度効率トレードオフを一貫して改善している。 GRPOと比較して、EAPOは平均パフォーマンスを10.45%、7.27%、9.69%改善し、平均ツールコールを18.33%、18.33%、および24.59%削減した。これらの結果から,エージェントはツール統合推論を損なうことなく,ツールを使わずに学習できることが示唆された。

論文の概要: Learning When Not to Act: Mitigating Tool Abuse in Agentic Reinforcement Learning

関連論文リスト