Fugu-MT 論文翻訳(概要): Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models

論文の概要: Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models

arxiv url: http://arxiv.org/abs/2605.23522v1
Date: Fri, 22 May 2026 11:37:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-25 17:29:20.330624
Title: Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models
Title（参考訳）: 精密: SDE-Consistent Stochastic Smpling for RL Post-Training of Flow-Matching Models (特集:SDE-Consistent Stochastic Smpling)
Authors: Jade Zou, Tao Huang, Weijie Kong, Junzhe Li, Yue Wu, Qi Tian, Jiangfeng Xiong, Jianwei Zhang, Liefeng Bo, Zhao Zhong,
Abstract要約: Reinforcement Learning (RL) は, 拡散・流れマッチングジェネレータにおいて, 迅速なアライメントと知覚品質の向上に有効な方法となっている。探索行動の制御と力学のデノベーションを行うサンプルは、この方針の一部である。有効探査と安定性のバランスをとる新しいサンプリング器を提案する。
参考スコア（独自算出の注目度）: 56.67321805551389
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning (RL) has become an effective way to improve prompt alignment and perceptual quality in diffusion and flow-matching generators. A critical step for applying online RL to flow matching is turning the deterministic sampling trajectory into a stochastic policy, typically by replacing the reverse-time Ordinary Differential Equation (ODE) with a Stochastic Differential Equation (SDE). The stochastic sampler, controlling the exploration behavior and denoising dynamics, is thus part of the policy, and its design can significantly affect the reward optimization performance. We break down the sampler design into two interdependent components: choosing the right amount of stochastic exploration, and discretizing the resulting SDE faithfully at the small step counts used in RL. To address the first component, we analyze the inherent tension between exploration and stability in denoising and derive an SDE schedule that balances the two. Turning to the discretization challenge, we use a toy example to show that existing samplers can deviate from the flow-matching process, either by introducing excessive discretization noise or by relying on heuristic rules that do not guarantee convergence to the data distribution. To address these issues, we propose Precise, a new stochastic sampler that balances effective exploration with stability. Crucially, Precise keeps the denoising trajectory SDE-consistent through a novel approximation that freezes the clean-latent posterior mean, resolving the excess noise issue in standard samplers. Extensive experiments demonstrate that this formulation leads to significantly faster and more stable reward optimization via reinforcement learning, achieving state-of-the-art alignment scores (e.g., PickScore, HPSv2.1) while requiring 13.1-53.2% less wall-clock training time to match the best in-domain performance of prior samplers.
Abstract（参考訳）: Reinforcement Learning (RL) は, 拡散・流れマッチングジェネレータにおいて, 迅速なアライメントと知覚品質の向上に有効な方法となっている。オンラインRLをフローマッチングに適用するための重要なステップは、決定論的サンプリング軌道を確率的ポリシーに変え、典型的には、逆時間正規微分方程式(ODE)を確率微分方程式(SDE)に置き換えることである。探索動作を制御し,ダイナミクスをデノナイズする確率的サンプリング器はポリシーの一部であり,その設計は報酬最適化性能に大きな影響を及ぼす可能性がある。サンプル設計を2つの相互依存的なコンポーネントに分割する: 適切な確率探索量を選択し、その結果のSDEをRLで使用される小さなステップ数で忠実に識別する。第1の要素に対処するために、探索と安定の間の固有の緊張関係を分析し、両者のバランスをとるSDEスケジュールを導出する。離散化の課題に目を向けると、既存のサンプルは過度な離散化ノイズを導入するか、データ分布への収束を保証しないヒューリスティックなルールに頼ることによって、フローマッチングプロセスから逸脱できることを示す。これらの問題に対処するために, 有効探索と安定性のバランスをとる新しい確率的サンプリング器であるPreciseを提案する。重要なこととして、Preciseは、新しい近似を通じてノイズ発生軌道SDEを保ち、クリーン遅延後平均を凍結し、標準サンプリング器の余剰ノイズ問題を解消する。大規模な実験は、この定式化が強化学習によって大幅に高速で安定した報酬最適化をもたらし、最先端のアライメントスコア(例えば、PickScore、HPSv2.1)を達成し、先行サンプルのドメイン内での最高のパフォーマンスに適合するために、13.1-53.2%のウォールクロックトレーニング時間を必要とすることを示した。

論文の概要: Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models

関連論文リスト