Fugu-MT 論文翻訳(概要): AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

論文の概要: AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

arxiv url: http://arxiv.org/abs/2605.00425v1
Date: Fri, 01 May 2026 05:54:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-04 17:43:28.863641
Title: AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning
Title（参考訳）: AEM: エージェント強化学習のための適応エントロピー変調
Authors: Haotian Zhao, Yuxin Zhang, Songlin Zhou, Stephen S. -T. Yau, Wenyu Zhang, Lun Tian, Tianshu Zhu, Yifeng Huang, Yucheng Zeng, Jingnan Gu, Daxiang Dong, Jianmin Wu,
Abstract要約: 強化学習(RL)は、大規模言語モデル(LLM)エージェントが環境と相互作用し、マルチターンタスクを解く能力を大幅に進歩させた。しかし、結果のみの報酬は、エージェントの行動軌跡における個々のステップにクレジットを割り当てるのが難しくなるため、効果的なトレーニングは依然として困難である。本稿では、RLトレーニング中にエントロピーのダイナミクスを適応的に調整し、より効果的な探索・探索トレードオフを実現するための、監督不要な信用割当手法であるAEMを提案する。
参考スコア（独自算出の注目度）: 13.755500788361815
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning (RL) has significantly advanced the ability of large language model (LLM) agents to interact with environments and solve multi-turn tasks. Yet effective training remains challenging, as sparse, outcome-only rewards make it difficult to assign credit to individual steps in an agent's action trajectory. A common remedy is to introduce dense intermediate supervision, such as process reward models or auxiliary self-supervised signals, but this increases supervision and tuning complexity and often generalizes poorly across tasks and domains. This paper presents AEM, a supervision-free credit assignment method that adaptively modulates entropy dynamics during RL training to achieve a more effective exploration-exploitation trade-off. Theoretically, we elevate entropy analysis from the token level to the response level to reduce token sampling variance and show that entropy drift under natural gradients is intrinsically governed by the product of the advantage and the relative response surprisal. Specifically, we derive a practical proxy to reshape training dynamics, enabling a natural transition from exploration to exploitation. Extensive experiments across various benchmarks and models ranging from 1.5B to 32B parameters demonstrate the effectiveness of AEM, including a notable 1.4 percent gain when integrated into a state-of-the-art baseline on the highly challenging SWE-bench-Verified benchmark.
Abstract（参考訳）: 強化学習(RL)は、大規模言語モデル(LLM)エージェントが環境と相互作用し、マルチターンタスクを解く能力を大幅に進歩させた。しかし、結果のみの報酬は、エージェントの行動軌跡における個々のステップにクレジットを割り当てるのが難しくなるため、効果的なトレーニングは依然として困難である。一般的な治療法は、プロセス報酬モデルや補助的な自己監督信号のような密集した中間的監督を導入することであるが、これは監督とチューニングの複雑さを高め、多くの場合、タスクやドメイン間での一般化が不十分である。本稿では、RLトレーニング中にエントロピーのダイナミクスを適応的に調整し、より効果的な探索・探索トレードオフを実現するための、監督不要な信用割当手法であるAEMを提案する。理論的には、トークンレベルから応答レベルへのエントロピー解析を向上し、トークンサンプリングのばらつきを低減し、自然勾配下でのエントロピーのドリフトが、その利点と相対的な応答の積によって本質的に制御されていることを示す。具体的には、トレーニングダイナミクスを再形成する実用的なプロキシを導出し、探索から搾取への自然な移行を可能にする。 1.5Bから32Bのパラメータを含む様々なベンチマークおよびモデルにわたる大規模な実験は、非常に挑戦的なSWE-bench-Verifiedベンチマーク上で最先端のベースラインに統合された場合、注目すべき1.4パーセントの増加を含む、AEMの有効性を示している。

論文の概要: AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

関連論文リスト