Fugu-MT 論文翻訳(概要): Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning

論文の概要: Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning

arxiv url: http://arxiv.org/abs/2509.22601v2
Date: Thu, 09 Oct 2025 04:27:07 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-10 15:34:28.711554
Title: Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning
Title（参考訳）: ロープを学び、勝利を信頼する: エージェント強化学習のための進歩的探索による自己想像
Authors: Yulei Qin, Xiaoyu Tan, Zhengbao He, Gang Li, Haojia Lin, Zongyi Li, Zihan Xu, Yuchen Shi, Siqi Cai, Renting Rui, Shaofei Cai, Yuzheng Cai, Xuan Zhang, Sheng Ye, Ke Li, Xing Sun,
Abstract要約: エージェントLLMを学習するためのカリキュラムベースの自己アニメーション学習(SIL)レシピであるSPEARを提案する。具体的には,本手法は,本質的な報奨を生かして,技術レベルの探究を促進するためのカリキュラムを取り入れている。さらにトレーニングを安定させるために、リプレイバッファでの経験の利点を再検討し、潜在的ポリシードリフトに対処する。
参考スコア（独自算出の注目度）: 41.90621652673528
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning (RL) is the dominant paradigm for sharpening strategic tool use capabilities of LLMs on long-horizon, sparsely-rewarded agent tasks, yet it faces a fundamental challenge of exploration-exploitation trade-off. Existing studies stimulate exploration through the lens of policy entropy, but such mechanical entropy maximization is prone to RL training instability due to the multi-turn distribution shifting. In this paper, we target the progressive exploration-exploitation balance under the guidance of the agent own experiences without succumbing to either entropy collapsing or runaway divergence. We propose SPEAR, a curriculum-based self-imitation learning (SIL) recipe for training agentic LLMs. It extends the vanilla SIL framework, where a replay buffer stores self-generated promising trajectories for off-policy update, by gradually steering the policy evolution within a well-balanced range of entropy across stages. Specifically, our approach incorporates a curriculum to manage the exploration process, utilizing intrinsic rewards to foster skill-level exploration and facilitating action-level exploration through SIL. At first, the auxiliary tool call reward plays a critical role in the accumulation of tool-use skills, enabling broad exposure to the unfamiliar distributions of the environment feedback with an upward entropy trend. As training progresses, self-imitation gets strengthened to exploit existing successful patterns from replayed experiences for comparative action-level exploration, accelerating solution iteration without unbounded entropy growth. To further stabilize training, we recalibrate the advantages of experiences in the replay buffer to address the potential policy drift. Reugularizations such as the clipping of tokens with high covariance between probability and advantage are introduced to the trajectory-level entropy control to curb over-confidence.
Abstract（参考訳）: 強化学習(Reinforcement Learning, RL)は, 長期的かつ疎遠なエージェントタスクにおいて, LLMの戦略ツール利用能力を向上するための主要なパラダイムであるが, 探索・探索トレードオフの根本的な課題に直面している。既存の研究は、政策エントロピーのレンズによる探索を刺激するが、そのような機械エントロピーの最大化は、マルチターン分布シフトによるRLトレーニング不安定性に起因する。本稿では,エントロピー崩壊や脱走の相違を生じさせることなく,エージェント自身の経験の指導の下で,進行的な探査・探査バランスを目標とする。エージェントLLMを学習するためのカリキュラムベースの自己アニメーション学習(SIL)レシピであるSPEARを提案する。リプレイバッファは、段階的にバランスのとれたエントロピーの範囲内で、ポリシーの進化を段階的にコントロールすることで、自己生成可能なトラジェクトリを非政治的な更新のために格納する。具体的には,本手法は,本質的な報酬を生かして,スキルレベルの探索を促進し,SILによるアクションレベルの探索を促進するためのカリキュラムを取り入れている。まず、補助ツールコール報酬は、ツール利用スキルの蓄積において重要な役割を担い、上向きのエントロピー傾向を伴う環境フィードバックの不慣れな分布に広範囲に暴露することができる。トレーニングが進むにつれて、自己想像が強化され、比較アクションレベルの探索や、無制限のエントロピー成長を伴わないソリューションイテレーションの加速といった、既存の成功パターンを活用することが可能になります。さらにトレーニングを安定させるために、リプレイバッファでの経験の利点を再検討し、潜在的ポリシードリフトに対処する。軌道レベルのエントロピー制御には、確率と優位性の間の共分散性の高いトークンのクリッピングのような拡張が導入され、過信を抑制する。

論文の概要: Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning

関連論文リスト