Fugu-MT 論文翻訳(概要): Laminar: A Scalable Asynchronous RL Post-Training Framework

論文の概要: Laminar: A Scalable Asynchronous RL Post-Training Framework

arxiv url: http://arxiv.org/abs/2510.12633v1
Date: Tue, 14 Oct 2025 15:29:14 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-15 19:02:32.372093
Title: Laminar: A Scalable Asynchronous RL Post-Training Framework
Title（参考訳）: Laminar: スケーラブルな非同期RLポストトレーニングフレームワーク
Authors: Guangming Sheng, Yuxuan Tong, Borui Wan, Wang Zhang, Chaobo Jia, Xibin Wu, Yuqi Wu, Xiang Li, Chi Zhang, Yanghua Peng, Haibin Lin, Xin Liu, Chuan Wu,
Abstract要約: RL軌道生成における長い尾の歪みは、重いGPU不使用を引き起こす。現在のRLシステムはアクターとロールアウト間のグローバルな重量同期に依存しており、厳密なモデル更新スケジュールを生成する。完全に分離されたアーキテクチャ上に構築されたスケーラブルで堅牢なRLポストトレーニングシステムであるLaminarを提案する。
参考スコア（独自算出の注目度）: 20.127034898123508
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning (RL) post-training for Large Language Models (LLMs) is now scaling to large clusters and running for extended durations to enhance model reasoning performance. However, the scalability of existing RL frameworks is limited, as extreme long-tail skewness in RL trajectory generation causes severe GPU underutilization. Current asynchronous RL systems attempt to mitigate this, but they rely on global weight synchronization between the actor and all rollouts, which creates a rigid model update schedule. This global synchronization is ill-suited for the highly skewed and evolving distribution of trajectory generation latency in RL training, crippling training efficiency. Our key insight is that efficient scaling requires breaking this lockstep through trajectory-level asynchrony, which generates and consumes each trajectory independently. We propose Laminar, a scalable and robust RL post-training system built on a fully decoupled architecture. First, we replace global updates with a tier of relay workers acting as a distributed parameter service. This enables asynchronous and fine-grained weight synchronization, allowing rollouts to pull the latest weight anytime without stalling the actor's training loop. Second, a dynamic repack mechanism consolidates long-tail trajectories onto a few dedicated rollouts, maximizing generation throughput. The fully decoupled design also isolates failures, ensuring robustness for long-running jobs. Our evaluation on a 1024-GPU cluster shows that Laminar achieves up to 5.48$\times$ training throughput speedup over state-of-the-art systems, while reducing model convergence time.
Abstract（参考訳）: 大規模言語モデル(LLM)のための強化学習(RL)ポストトレーニングが,大規模クラスタへのスケールアップと,モデル推論のパフォーマンス向上のために,長期にわたって実行できるようになった。しかし、既存のRLフレームワークのスケーラビリティは制限されている。現在の非同期RLシステムは、これを緩和しようとするが、アクターとロールアウト間のグローバルな重量同期に依存しており、厳密なモデル更新スケジュールを生成する。このグローバル同期は、RLトレーニングにおける軌道生成遅延の高度に歪んだ、そして進化した分布に不適であり、訓練効率を損なう。私たちの重要な洞察は、効率的なスケーリングは、各トラジェクトリを独立して生成し、消費するトラジェクトリレベルの非同期を通じて、このロックステップを破ることが必要です。完全に分離されたアーキテクチャ上に構築されたスケーラブルで堅牢なRLポストトレーニングシステムであるLaminarを提案する。まず、グローバルアップデートを分散パラメータサービスとして機能するリレーワーカー層に置き換える。これにより、非同期できめ細かなウェイト同期が可能になり、ロールアウトがアクターのトレーニングループを停止することなく、いつでも最新のウェイトをプルすることができる。第二に、ダイナミックリパック機構は、長いテール軌道をいくつかの専用ロールアウトに集約し、生成スループットを最大化する。完全に分離された設計は、障害を分離し、長時間稼働するジョブに対して堅牢性を確保する。 1024-GPUクラスタ上での評価では,モデル収束時間を短縮しつつ,最先端システムのスループットを最大5.48$\timesでトレーニングすることができる。

論文の概要: Laminar: A Scalable Asynchronous RL Post-Training Framework

関連論文リスト