Fugu-MT 論文翻訳(概要): Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

論文の概要: Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

arxiv url: http://arxiv.org/abs/2605.12070v1
Date: Tue, 12 May 2026 12:57:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:56.8661
Title: Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction
Title（参考訳）: 非同期エージェントRLにおける古いログの欠失:意味的ミスマッチとオフポリケーション補正法
Authors: Zhong Guan, Yongjian Guo, Haoran Sun, Wen Huang, Shuai Di, Xiong Jun Wu, Likang Wu, Hongke Zhao,
Abstract要約: 非同期強化学習は、大規模言語モデルエージェントのロールアウトスループットを改善する。また、PPOスタイルのオフポリシー修正のための重要な障害モードも導入している。更新の遅れや部分的なロールアウトを伴う実用的なパイプラインでは,必要なトレーニング側ロジットが失われることがよくあります。
参考スコア（独自算出の注目度）: 27.34307252485658
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Asynchronous reinforcement learning improves rollout throughput for large language model agents by decoupling sample generation from policy optimization, but it also introduces a critical failure mode for PPO-style off-policy correction. In heterogeneous training systems, the total importance ratio should ideally be decomposed into two semantically distinct factors: a \emph{training--inference discrepancy term} that aligns inference-side and training-side distributions at the same behavior-policy version, and a \emph{policy-staleness term} that constrains the update from the historical policy to the current policy. We show that practical asynchronous pipelines with delayed updates and partial rollouts often lose the required historical training-side logits, or old logits. This missing-old-logit problem entangles discrepancy repair with staleness correction, breaks the intended semantics of decoupled correction, and makes clipping and masking thresholds interact undesirably. To address this issue, we study both exact and approximate correction routes. We propose three exact old-logit acquisition strategies: snapshot-based version tracking, a dedicated old-logit model, and synchronization via partial rollout interruption, and compare their system trade-offs. From the perspective of approximate correction, we focus on preserving the benefits of decoupled correction through a more appropriate approximate policy when exact old logits cannot be recovered at low cost, without incurring extra system overhead. Following this analysis, we adopt a revised PPO-EWMA method, which achieves significant gains in both training speed and optimization performance. Code at https://github.com/millioniron/ROLL.
Abstract（参考訳）: 非同期強化学習は、政策最適化からサンプル生成を分離することで、大規模言語モデルエージェントのロールアウトスループットを向上させるが、PPOスタイルのオフポリシー修正のための重要な障害モードも導入する。不均質な訓練システムでは、全重要度は、2つの意味的に区別された要因に分解されるべきである: 推論側とトレーニング側を同一の行動-政治バージョンで整列する \emph{training-inference discrepancy term} と、歴史的政策から現在の政策への更新を制限する \emph{policy-staleness term} である。更新の遅れや部分的なロールアウトを伴う実用的な非同期パイプラインは、必要な履歴トレーニング側ロジットや古いロジットを失うことがよくあります。この欠落した論理的問題は、不一致の修復を不安定な修正で絡み、分離された修正の意図した意味を破り、クリッピングとマスキングのしきい値が好ましくない相互作用をさせる。この問題に対処するために,正確な補正経路と近似補正経路について検討する。我々は、スナップショットベースのバージョントラッキング、専用の古ログモデル、部分的なロールアウト中断による同期の3つの正確な古ログ取得戦略を提案し、システムトレードオフを比較した。近似補正の観点からは、システムオーバーヘッドを余分に発生させることなく、正確な古いログを低コストで回収できない場合に、より適切な近似ポリシーによりデカップリング補正の利点を維持することに重点を置いている。そこで本研究では,PPO-EWMA法を改良し,学習速度と最適化性能の両面で有意な向上を実現した。コードネームはhttps://github.com/millioniron/ROLL。

論文の概要: Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

関連論文リスト