Fugu-MT 論文翻訳(概要): Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Training and Rollout Precision Flow

論文の概要: Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Training and Rollout Precision Flow

arxiv url: http://arxiv.org/abs/2601.14243v1
Date: Tue, 20 Jan 2026 18:54:31 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-21 22:47:23.459094
Title: Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Training and Rollout Precision Flow
Title（参考訳）: Jet-RL: 統一トレーニングとロールアウト精度フローによるオンラインFP8強化学習の実現
Authors: Haocheng Xi, Charlie Ruan, Peiyuan Liao, Yujun Lin, Han Cai, Yilong Zhao, Shuo Yang, Kurt Keutzer, Song Han, Ligeng Zhu,
Abstract要約: 本研究は,FP8 RLトレーニングの総合的研究である。安定かつ堅牢なRL最適化を実現するFP8 RLトレーニングフレームワークであるJet-RLを提案する。
参考スコア（独自算出の注目度）: 48.48936574810267
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning (RL) is essential for enhancing the complex reasoning capabilities of large language models (LLMs). However, existing RL training pipelines are computationally inefficient and resource-intensive, with the rollout phase accounting for over 70% of total training time. Quantized RL training, particularly using FP8 precision, offers a promising approach to mitigating this bottleneck. A commonly adopted strategy applies FP8 precision during rollout while retaining BF16 precision for training. In this work, we present the first comprehensive study of FP8 RL training and demonstrate that the widely used BF16-training + FP8-rollout strategy suffers from severe training instability and catastrophic accuracy collapse under long-horizon rollouts and challenging tasks. Our analysis shows that these failures stem from the off-policy nature of the approach, which introduces substantial numerical mismatch between training and inference. Motivated by these observations, we propose Jet-RL, an FP8 RL training framework that enables robust and stable RL optimization. The key idea is to adopt a unified FP8 precision flow for both training and rollout, thereby minimizing numerical discrepancies and eliminating the need for inefficient inter-step calibration. Extensive experiments validate the effectiveness of Jet-RL: our method achieves up to 33% speedup in the rollout phase, up to 41% speedup in the training phase, and a 16% end-to-end speedup over BF16 training, while maintaining stable convergence across all settings and incurring negligible accuracy degradation.
Abstract（参考訳）: 強化学習(RL)は,大規模言語モデル(LLM)の複雑な推論能力の向上に不可欠である。しかし、既存のRLトレーニングパイプラインは計算的に非効率でリソース集約であり、ロールアウトフェーズは総トレーニング時間の70%以上を占める。量子RLトレーニング、特にFP8精度の使用は、このボトルネックを軽減するための有望なアプローチを提供する。一般的に採用されている戦略は、BF16の訓練精度を維持しながら、ロールアウト中にFP8の精度を適用している。本研究は,FP8 RLトレーニングの総合的研究であり,広範に使用されているBF16トレーニング+FP8ロールアウト戦略が,長期のロールアウトおよび課題における厳しいトレーニング不安定性と破滅的な精度崩壊に悩まされていることを示す。分析の結果、これらの失敗は、トレーニングと推論の間にかなりの数値的なミスマッチをもたらすアプローチの非政治的性質に起因していることが明らかとなった。これらの観測により,安定かつ堅牢なRL最適化を実現するFP8 RLトレーニングフレームワークであるJet-RLを提案する。鍵となる考え方は、トレーニングとロールアウトの両方に統一されたFP8精度フローを採用することである。実験によりJet-RLの有効性が検証された。本手法はロールアウトフェーズで最大33%のスピードアップ、トレーニングフェーズで最大41%のスピードアップ、BF16トレーニングで16%のエンド・ツー・エンド・スピードアップを実現し、全ての設定に安定した収束を維持しつつ、無視できない精度の劣化を生じさせる。

論文の概要: Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Training and Rollout Precision Flow

関連論文リスト