Fugu-MT 論文翻訳(概要): AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

論文の概要: AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

arxiv url: http://arxiv.org/abs/2605.00425v3
Date: Fri, 08 May 2026 06:22:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 16:31:22.521638
Title: AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning
Title（参考訳）: AEM: エージェント強化学習のための適応エントロピー変調
Authors: Haotian Zhao, Songlin Zhou, Yuxin Zhang, Stephen S. -T. Yau, Wenyu Zhang, Lun Tian, Tianshu Zhu, Yifeng Huang, Yucheng Zeng, Jingnan Gu, Daxiang Dong, Jianmin Wu,
Abstract要約: 強化学習(RL)は、大規模言語モデル(LLM)エージェントが環境と相互作用し、マルチターンタスクを解く能力を大幅に改善した。既存のアプローチは、プロセス報酬モデルや補助的な自己監督信号など、密集した中間監視を導入することが多い。本稿では、RLトレーニング中にエントロピーダイナミクスを適応的に調整し、探索・探索トレードオフを改善するための監督不要な信用割当手法であるAEMを提案する。
参考スコア（独自算出の注目度）: 13.755500788361815
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning (RL) has substantially improved the ability of large language model (LLM) agents to interact with environments and solve multi-turn tasks. However, effective agentic RL remains challenging: sparse outcome-only rewards provide limited guidance for assigning credit to individual steps within long interaction trajectories. Existing approaches often introduce dense intermediate supervision, such as process reward models or auxiliary self-supervised signals, which increases supervision and tuning complexity and may limit generalization across tasks and domains. We present AEM, a supervision-free credit assignment method that adaptively modulates entropy dynamics during RL training to improve the exploration-exploitation trade-off. Since in agentic RL the environment is typically affected by a complete response, rather than an individual token, our analysis lifts entropy dynamics from the token level to the response level, aligning uncertainty estimation with the effective action granularity of LLM agents and reducing sensitivity to token-level sampling noise. We further show that entropy drift under natural-gradient updates is governed by the interaction between the sampled-response advantage and its relative surprisal. Motivated by this result, AEM derives a practical response-level uncertainty proxy and uses it to rescale advantages, leveraging the evolving balance between positive and negative samples to naturally transition from exploration to exploitation. Extensive experiments on ALFWorld, WebShop, and SWE-bench-Verified with models ranging from 1.5B to 32B demonstrate that AEM consistently improves strong RL baselines, including a +1.4\% gain when integrated into a state-of-the-art software-engineering RL training framework.
Abstract（参考訳）: 強化学習(RL)は、大規模言語モデル(LLM)エージェントが環境と相互作用し、マルチターンタスクを解く能力を大幅に改善した。しかし、効果的なエージェントRLは依然として困難なままであり、少ない結果のみの報酬は、長期の相互作用軌跡内の個々のステップにクレジットを割り当てるための限定的なガイダンスを提供する。既存のアプローチでは、プロセス報酬モデルや補助的な自己監督信号のような密集した中間的監督を導入し、監督とチューニングの複雑さを高め、タスクやドメイン間の一般化を制限することがある。本稿では、RLトレーニング中にエントロピーダイナミクスを適応的に調整し、探索・探索トレードオフを改善するための監督不要な信用割当手法であるAEMを提案する。エージェントRLでは, 個々のトークンではなく, 完全な応答によって環境が影響を受けるため, トークンレベルから応答レベルへのエントロピーダイナミクスを持ち上げ, 有効作用粒度と不確実性評価を一致させ, トークンレベルのサンプリングノイズに対する感度を低減させる。さらに, 自然段階更新時のエントロピードリフトは, サンプル応答の優位性と相対的前提との相互作用によって制御されることを示した。この結果により、AEMは実用的な応答レベルの不確実性プロキシを導出し、正と負のサンプルのバランスの進化を活用して、探索から搾取へと自然に移行する。 ALFWorld、WebShop、SWE-bench-Verifiedの大規模な実験では、1.5Bから32Bのモデルによって、AEMは、最先端のソフトウェアエンジニアリングRLトレーニングフレームワークに統合された場合の、+1.4\%のゲインを含む、強力なRLベースラインを一貫して改善することを示した。

論文の概要: AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

関連論文リスト