Fugu-MT 論文翻訳(概要): Preventing Learning Stagnation in PPO by Scaling to 1 Million Parallel Environments

論文の概要: Preventing Learning Stagnation in PPO by Scaling to 1 Million Parallel Environments

arxiv url: http://arxiv.org/abs/2603.06009v1
Date: Fri, 06 Mar 2026 08:07:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 13:17:45.302811
Title: Preventing Learning Stagnation in PPO by Scaling to 1 Million Parallel Environments
Title（参考訳）: 100万の並列環境へのスケーリングによるPPOの学習停滞防止
Authors: Michael Beukman, Khimya Khetarpal, Zeyu Zheng, Will Dabney, Jakob Foerster, Michael Dennis, Clare Lyle,
Abstract要約: 特定の体制における高原は、損失のサンプルベースの推定が、訓練の過程で真の目的のために不十分なプロキシとなるために生じる。このタイプの学習の停滞に対処する方法には,ステップサイズを縮小するか,更新間で収集されたサンプル数を増やすかの2つがある。我々は、PPOを100万以上の並列環境に拡張することにより、複雑なオープン化された領域における事前ベースラインを大幅に上回る。
参考スコア（独自算出の注目度）: 31.754045125599305
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Plateaus, where an agent's performance stagnates at a suboptimal level, are a common problem in deep on-policy RL. Focusing on PPO due to its widespread adoption, we show that plateaus in certain regimes arise not because of known exploration, capacity, or optimization challenges, but because sample-based estimates of the loss eventually become poor proxies for the true objective over the course of training. As a recap, PPO switches between sampling rollouts from several parallel environments online using the current policy (which we call the outer loop) and performing repeated minibatch SGD steps against this offline dataset (the inner loop). In our work we consider only the outer loop, and conceptually model it as stochastic optimization. The step size is then controlled by the regularization strength towards the previous policy and the gradient noise by the number of samples collected between policy update steps. This model predicts that performance will plateau at a suboptimal level if the outer step size is too large relative to the noise. Recasting PPO in this light makes it clear that there are two ways to address this particular type of learning stagnation: either reduce the step size or increase the number of samples collected between updates. We first validate the predictions of our model and investigate how hyperparameter choices influence the step size and update noise, concluding that increasing the number of parallel environments is a simple and robust way to reduce both factors. Next, we propose a recipe for how to co-scale the other hyperparameters when increasing parallelization, and show that incorrectly doing so can lead to severe performance degradation. Finally, we vastly outperform prior baselines in a complex open-ended domain by scaling PPO to more than 1M parallel environments, thereby enabling monotonic performance improvement up to one trillion transitions.
Abstract（参考訳）: エージェントのパフォーマンスが最適下層レベルで停滞するプラトースは、深い政治RLにおいて一般的な問題である。広範に普及しているPPOに注目すると、特定の体制の台地は、既知の探索、キャパシティ、最適化の課題によって生じるのではなく、サンプルベースによる損失推定が、訓練の過程で真の目的のために悪いプロキシとなることが示される。まとめると、PPOは現在のポリシー(外部ループと呼んでいる)を使用して、オンラインの複数の並列環境からのサンプリングロールアウトを切り替え、このオフラインデータセット(内部ループ)に対して、繰り返しミニバッチSGDステップを実行します。我々の研究では、外ループのみを考慮し、概念的に確率的最適化としてモデル化する。ステップサイズは、前のポリシーに対する正規化強度と、ポリシー更新ステップ間で収集されたサンプル数による勾配ノイズによって制御される。このモデルは、外段の大きさがノイズに対して大きすぎる場合、最適下段で性能が低下すると予想する。この光でPPOをリキャストすることで、この特定のタイプの学習停滞に対処する方法が2つあることが明らかになった。まず,提案モデルの予測を検証し,高パラメータ選択がステップサイズや更新ノイズにどのように影響するかを検証し,並列環境の増加が両要因の低減のための単純かつ堅牢な方法であると結論付けた。次に,並列化の増加に伴って他のハイパーパラメータを共スケールする方法を提案する。最後に、PPOを100万以上の並列環境に拡張することにより、複雑なオープンエンド領域における事前ベースラインを大幅に上回る。

関連論文リスト

Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization [60.87651283510059]
Group Relative Policy Optimization (GRPO) はLLM推論を効果的にスケールするが、計算コストは禁じている。本研究では,非バイアス勾配推定を保ちながら動的プルーニングを可能にする動的プルーニングポリシー最適化(DPPO)を提案する。刈り込みによって引き起こされるデータの空間性を軽減するため,ウィンドウベースの欲求戦略であるDense Prompt Packingを導入する。
論文参考訳（メタデータ） (2026-03-04T14:48:53Z)
Rethinking the Trust Region in LLM Reinforcement Learning [72.25890308541334]
PPO(Proximal Policy Optimization)は、大規模言語モデル(LLM)のデファクト標準アルゴリズムとして機能する。より原則的な制約でクリッピングを代用する多変量確率ポリシー最適化(DPPO)を提案する。 DPPOは既存の方法よりも優れたトレーニングと効率を実現し、RLベースの微調整のためのより堅牢な基盤を提供する。
論文参考訳（メタデータ） (2026-02-04T18:59:04Z)
Coverage Improvement and Fast Convergence of On-policy Preference Learning [67.36750525893514]
言語モデルアライメントのためのオンラインのオンラインプライオリティ学習アルゴリズムは、オフラインのアルゴリズムよりも大幅に優れている。我々は,サンプリング政策の包括的範囲が政治訓練を通じてどのように進展するかを分析する。一般機能クラス設定における報奨蒸留のための原則的オンライン方式を開発した。
論文参考訳（メタデータ） (2026-01-13T10:46:06Z)
Closing the Approximation Gap of Partial AUC Optimization: A Tale of Two Formulations [121.39938773554523]
ROC曲線の下の領域(AUC)は、クラス不均衡と決定制約の両方を持つ実世界のシナリオにおける重要な評価指標である。 PAUC最適化の近似ギャップを埋めるために,2つの簡単なインスタンス単位のミニマックス修正を提案する。得られたアルゴリズムは、サンプルサイズと典型的な一方方向と双方向のPAUCに対して$O(-2/3)$の収束率の線形パーイテレーション計算複雑性を享受する。
論文参考訳（メタデータ） (2025-12-01T02:52:33Z)
Quantile Reward Policy Optimization: Alignment with Pointwise Regression and Exact Partition Functions [0.5416466085090772]
emphQuantile Reward Policy Optimization (QRPO)を導入する。 QRPO は KL-正則化 RL 目的の閉形式解への回帰を可能にするために量子的報酬を使用する。チャットとコーディングの評価において、一貫して最高のパフォーマンスを達成する。
論文参考訳（メタデータ） (2025-07-10T17:56:24Z)
A Reinforcement Learning Method for Environments with Stochastic Variables: Post-Decision Proximal Policy Optimization with Dual Critic Networks [2.3453441553817043]
決定後近ポリシー最適化は、先進的な深層強化学習手法である近ポリシー最適化の新しいバリエーションである。提案手法は,問題の次元を小さくし,値関数推定の精度を高めるために,事後状態と二重批判を取り入れたものである。
論文参考訳（メタデータ） (2025-04-07T14:56:43Z)
Efficient Learning of POMDPs with Known Observation Model in Average-Reward Setting [56.92178753201331]
我々は,POMDPパラメータを信念に基づくポリシを用いて収集したサンプルから学習することのできる観測・認識スペクトル(OAS)推定手法を提案する。提案するOAS-UCRLアルゴリズムに対して,OASプロシージャの整合性を示し,$mathcalO(sqrtT log(T)$の残差保証を証明した。
論文参考訳（メタデータ） (2024-10-02T08:46:34Z)
SAPG: Split and Aggregate Policy Gradients [37.433915947580076]
本稿では,大規模環境をチャンクに分割し,重要サンプリングにより融合させることにより,大規模環境を効果的に活用できる新しいオンラインRLアルゴリズムを提案する。我々のアルゴリズムはSAPGと呼ばれ、バニラPPOや他の強力なベースラインが高い性能を達成できない様々な困難環境において、非常に高い性能を示す。
論文参考訳（メタデータ） (2024-07-29T17:59:50Z)
You May Not Need Ratio Clipping in PPO [117.03368180633463]
Proximal Policy Optimization (PPO) 法は、複数のミニバッチ最適化エポックを1組のサンプルデータで反復的に実行することでポリシーを学習する。比率クリッピングPPOは、ターゲットポリシーとサンプル収集に使用されるポリシーの確率比をクリップする一般的な変種である。本論文では, この比クリッピングが有効に結合できないため, 良好な選択ではないことを示す。 ESPOは、多くのワーカーによる分散トレーニングに簡単にスケールアップでき、パフォーマンスも高いことを示す。
論文参考訳（メタデータ） (2022-01-31T20:26:56Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。