Fugu-MT 論文翻訳(概要): FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

論文の概要: FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

arxiv url: http://arxiv.org/abs/2604.06916v1
Date: Wed, 08 Apr 2026 10:14:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-09 17:30:51.476358
Title: FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling
Title（参考訳）: FP4 Explore, BF16 Train: 効率的なロールアウトスケーリングによる拡散強化学習
Authors: Yitong Li, Junsong Chen, Shuchen Xue, Pengcuo Zeren, Siyuan Fu, Dinghao Yang, Yangyang Tang, Junjie Bai, Ping Luo, Song Han, Enze Xie,
Abstract要約: 強化学習に基づくポストトレーニングは、テキストと画像の拡散モデルと人間の嗜好を整合させるための有望なパラダイムとして現れてきた。大規模基礎拡散モデル(FLUX.1-12Bなど)のスケールアウトは、計算負荷が大きい。本稿では,新しいFP4を用いた2段階強化学習フレームワークであるSol-RLを提案する。
参考スコア（独自算出の注目度）: 38.64059734487925
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement-Learning-based post-training has recently emerged as a promising paradigm for aligning text-to-image diffusion models with human preferences. In recent studies, increasing the rollout group size yields pronounced performance improvements, indicating substantial room for further alignment gains. However, scaling rollouts on large-scale foundational diffusion models (e.g., FLUX.1-12B) imposes a heavy computational burden. To alleviate this bottleneck, we explore the integration of FP4 quantization into Diffusion RL rollouts. Yet, we identify that naive quantized pipelines inherently introduce risks of performance degradation. To overcome this dilemma between efficiency and training integrity, we propose Sol-RL (Speed-of-light RL), a novel FP4-empowered Two-stage Reinforcement Learning framework. First, we utilize high-throughput NVFP4 rollouts to generate a massive candidate pool and extract a highly contrastive subset. Second, we regenerate these selected samples in BF16 precision and optimize the policy exclusively on them. By decoupling candidate exploration from policy optimization, Sol-RL integrates the algorithmic mechanisms of rollout scaling with the system-level throughput gains of NVFP4. This synergistic algorithm-hardware design effectively accelerates the rollout phase while reserving high-fidelity samples for optimization. We empirically demonstrate that our framework maintains the training integrity of BF16 precision pipeline while fully exploiting the throughput gains enabled by FP4 arithmetic. Extensive experiments across SANA, FLUX.1, and SD3.5-L substantiate that our approach delivers superior alignment performance across multiple metrics while accelerating training convergence by up to $4.64\times$, unlocking the power of massive rollout scaling at a fraction of the cost.
Abstract（参考訳）: 強化学習に基づくポストトレーニングは,テキストと画像の拡散モデルと人間の嗜好を整合させる,有望なパラダイムとして最近登場した。近年の研究では、ロールアウトグループのサイズが大きくなると性能が向上し、アライメントがさらに向上する可能性が示唆されている。しかし、大規模基礎拡散モデル(FLUX.1-12B)のスケールアウトは、計算上の重荷を負う。このボトルネックを軽減するため、FP4量子化のDiffusion RLロールアウトへの統合について検討する。しかし、単純で量子化されたパイプラインは本質的にパフォーマンス劣化のリスクをもたらす。効率性とトレーニングの整合性の間のジレンマを克服するために,新しいFP4を用いた2段階強化学習フレームワークであるSol-RL(Speed-of-light RL)を提案する。まず、高速なNVFP4ロールアウトを利用して、巨大な候補プールを生成し、非常にコントラストの高いサブセットを抽出する。次に,選択したサンプルをBF16精度で再生し,そのポリシーのみに最適化する。ポリシー最適化から候補探索を分離することにより、Sol-RLはロールアウトスケーリングのアルゴリズム機構をNVFP4のシステムレベルのスループット向上と統合する。この相乗的アルゴリズム・ハードウェア設計は、最適化のための高忠実度サンプルを保存しながら、ロールアウトフェーズを効果的に加速する。我々は,我々のフレームワークが,FP4演算で実現したスループットゲインを完全に活用しながら,BF16精度パイプラインのトレーニング整合性を維持することを実証的に実証した。 SANA、FLUX.1、SD3.5-Lにわたる大規模な実験により、我々のアプローチは複数のメトリクスをまたいだアライメント性能を向上し、トレーニング収束を最大4.64\times$に加速し、大規模なロールアウトスケーリングのパワーをほんの少しのコストで解放する。

論文の概要: FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

関連論文リスト