Fugu-MT 論文翻訳(概要): FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning

論文の概要: FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning

arxiv url: http://arxiv.org/abs/2601.18150v1
Date: Mon, 26 Jan 2026 05:12:05 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-27 15:23:08.684197
Title: FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning
Title（参考訳）: FP8-RL: LLM強化学習のための実用的で安定な低精度スタック
Authors: Zhaopeng Qiu, Shuang Yu, Jingqi Zhang, Shuai Zhang, Xue Huang, Jingyi Yang, Junjie Lai,
Abstract要約: 本稿では,大規模言語モデル(LLM)のための実用的なFP8ロールアウトスタックを提案する。 i)ブロックワイズFP8量子化を用いてFP8 W8A8リニア層ロールアウトを実現し、(ii)FP8をKVキャッシュに拡張して長文メモリボトルネックを解消し、(iii)重要度に基づくロールアウト補正によるミスマッチを緩和する。高密度モデルとMoEモデル全体で、これらのテクニックは、BF16ベースラインに匹敵する学習行動を保ちながら、最大44%のロールアウトスループットゲインを提供する。
参考スコア（独自算出の注目度）: 12.855945066222743
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning (RL) for large language models (LLMs) is increasingly bottlenecked by rollout (generation), where long output sequence lengths make attention and KV-cache memory dominate end-to-end step time. FP8 offers an attractive lever for accelerating RL by reducing compute cost and memory traffic during rollout, but applying FP8 in RL introduces unique engineering and algorithmic challenges: policy weights change every step (requiring repeated quantization and weight synchronization into the inference engine) and low-precision rollouts can deviate from the higher-precision policy assumed by the trainer, causing train-inference mismatch and potential instability. This report presents a practical FP8 rollout stack for LLM RL, implemented in the veRL ecosystem with support for common training backends (e.g., FSDP/Megatron-LM) and inference engines (e.g., vLLM/SGLang). We (i) enable FP8 W8A8 linear-layer rollout using blockwise FP8 quantization, (ii) extend FP8 to KV-cache to remove long-context memory bottlenecks via per-step QKV scale recalibration, and (iii) mitigate mismatch using importance-sampling-based rollout correction (token-level TIS/MIS variants). Across dense and MoE models, these techniques deliver up to 44% rollout throughput gains while preserving learning behavior comparable to BF16 baselines.
Abstract（参考訳）: 大規模言語モデル(LLM)のための強化学習(RL)は、長い出力シーケンスの長さが注目され、KVキャッシュメモリがエンドツーエンドのステップタイムを支配しているロールアウト(世代)によって、ますますボトルネックになっている。 FP8は、ロールアウト中に計算コストとメモリトラフィックを削減してRLを加速する魅力的なレバーを提供するが、RLにFP8を適用すると、ポリシーウェイトが全ステップを変更(推論エンジンに繰り返し量子化と重み同期を要求する)し、低精度ロールアウトはトレーナーが想定する高精度なポリシーから逸脱し、列車の干渉ミスマッチと潜在的な不安定さを引き起こすという、ユニークなエンジニアリングとアルゴリズムの課題が導入される。本稿では、一般的なトレーニングバックエンド(FSDP/Megatron-LM)と推論エンジン(例えば、vLLM/SGLang)をサポートするveRLエコシステムに実装されたLLM RL用の実用的なFP8ロールアウトスタックを提案する。我が家 i)ブロックワイズFP8量子化を用いたFP8 W8A8線形層ロールアウトの実現。 (ii) FP8 を KV-cache に拡張し、ステップごとのQKVスケール再校正による長期コンテキストメモリボトルネックを除去し、三重要サンプリングに基づくロールアウト補正(トークンレベルTIS/MIS変種)によるミスマッチの軽減。高密度モデルとMoEモデル全体で、これらのテクニックは、BF16ベースラインに匹敵する学習行動を保ちながら、最大44%のロールアウトスループットゲインを提供する。

論文の概要: FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning

関連論文リスト