Fugu-MT 論文翻訳(概要): Agentic Entropy-Balanced Policy Optimization

論文の概要: Agentic Entropy-Balanced Policy Optimization

arxiv url: http://arxiv.org/abs/2510.14545v1
Date: Thu, 16 Oct 2025 10:40:52 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-17 21:15:14.821676
Title: Agentic Entropy-Balanced Policy Optimization
Title（参考訳）: エージェントエントロピーベースポリシー最適化
Authors: Guanting Dong, Licheng Bao, Zhongyuan Wang, Kangzhi Zhao, Xiaoxi Li, Jiajie Jin, Jinghan Yang, Hangyu Mao, Fuzheng Zhang, Kun Gai, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, Zhicheng Dou,
Abstract要約: エージェント強化学習(Agentic RL)は,Webエージェントの多ターン,長期ツール利用能力の活性化に大きく貢献している。 RLアルゴリズムはエントロピーの誘導の下で、高不確実性ツールコールステップを自律的に探索するが、エントロピー信号への過度な依存は、さらなる制約を課す可能性がある。本稿では,エージェント・エントロピー・バランサード・ポリシー最適化(AEPO, Agentic Entropy-Balanced Policy Optimization)を提案する。
参考スコア（独自算出の注目度）: 114.90524574220764
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, Agentic Reinforcement Learning (Agentic RL) has made significant progress in incentivizing the multi-turn, long-horizon tool-use capabilities of web agents. While mainstream agentic RL algorithms autonomously explore high-uncertainty tool-call steps under the guidance of entropy, excessive reliance on entropy signals can impose further constraints, leading to the training collapse. In this paper, we delve into the challenges caused by entropy and propose the Agentic Entropy-Balanced Policy Optimization (AEPO), an agentic RL algorithm designed to balance entropy in both the rollout and policy update phases. AEPO comprises two core components: (1) a dynamic entropy-balanced rollout mechanism that adaptively allocate global and branch sampling budget through entropy pre-monitoring, while imposing a branch penalty on consecutive high-entropy tool-call steps to prevent over-branching issues; and (2) Entropy-Balanced Policy Optimization that inserts a stop-gradient operation into the high-entropy clipping term to preserve and properly rescale gradients on high-entropy tokens, while incorporating entropy-aware advantage estimation to prioritize learning on high-uncertainty tokens. Results across 14 challenging datasets show that AEPO consistently outperforms 7 mainstream RL algorithms. With just 1K RL samples, Qwen3-14B with AEPO achieves impressive results: 47.6% on GAIA, 11.2% on Humanity's Last Exam, and 43.0% on WebWalker for Pass@1; 65.0% on GAIA, 26.0% on Humanity's Last Exam, and 70.0% on WebWalker for Pass@5. Further analysis reveals that AEPO improves rollout sampling diversity while maintaining stable policy entropy, facilitating scalable web agent training.
Abstract（参考訳）: 近年,エージェント強化学習 (Agentic RL) は,Webエージェントの多ターン・長距離ツール利用能力の向上に大きく貢献している。主流のエージェントRLアルゴリズムはエントロピーの指導の下で自律的に高い不確実性ツールコールステップを探索するが、エントロピー信号への過度な依存はさらなる制約を課し、トレーニングの崩壊につながる。本稿では,エントロピーによる課題を掘り下げ,ロールアウトとポリシー更新の両段階においてエントロピーのバランスをとるために設計されたエージェントRLアルゴリズムであるエージェントエントロピー・バランサード・ポリシー最適化(AEPO)を提案する。 AEPOは,(1)大域的および分枝的サンプリング予算をエントロピー前監視を通じて適応的に割り当てる動的エントロピーバランスのロールアウト機構と,(2)高エントロピークリッピング項に停止段階の操作を挿入して高エントロピートークンの勾配を保存し,適切に再スケールするエントロピーバランサードポリシー最適化と,(2)高アントロピートークンの学習を優先するエントロピーアウェア・アドバンスト推定を取り入れた。 14の挑戦的なデータセットに対する結果は、AEPOが7つの主流RLアルゴリズムを一貫して上回っていることを示している。 GAIAの47.6%、Humanityの11.2%、Pass@1のWebWalkerの43.0%、GAIAの65.0%、HumanityのLast Examの26.0%、Pass@5のWebWalkerの70.0%である。さらに分析した結果、AEPOは安定したポリシーエントロピーを維持しつつ、ロールアウトサンプリングの多様性を改善し、スケーラブルなWebエージェントのトレーニングを容易にすることが明らかになった。

論文の概要: Agentic Entropy-Balanced Policy Optimization

関連論文リスト