Fugu-MT 論文翻訳(概要): Rethinking the Design Space of Reinforcement Learning for Diffusion Models: On the Importance of Likelihood Estimation Beyond Loss Design

論文の概要: Rethinking the Design Space of Reinforcement Learning for Diffusion Models: On the Importance of Likelihood Estimation Beyond Loss Design

arxiv url: http://arxiv.org/abs/2602.04663v1
Date: Wed, 04 Feb 2026 15:36:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-05 19:45:11.59441
Title: Rethinking the Design Space of Reinforcement Learning for Diffusion Models: On the Importance of Likelihood Estimation Beyond Loss Design
Title（参考訳）: 拡散モデルにおける強化学習のデザイン空間の再考:損失設計を超えての同義性推定の重要性について
Authors: Jaemoo Choi, Yuchen Zhu, Wei Guo, Petr Molodyk, Bo Yuan, Jinbin Bai, Yi Xin, Molei Tao, Yongxin Chen,
Abstract要約: 本稿では,政策段階の目標,可能性推定器,ロールアウトサンプリングスキームの3つの要因を解消し,RL設計空間を体系的に解析する。最終生成標本からのみ計算されるエビデンス低境界モデル推定器(ELBO)を採用することが,有効,効率的,安定なRL最適化を実現する主要な要因であることを示す。
参考スコア（独自算出の注目度）: 45.80068602880684
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning has been widely applied to diffusion and flow models for visual tasks such as text-to-image generation. However, these tasks remain challenging because diffusion models have intractable likelihoods, which creates a barrier for directly applying popular policy-gradient type methods. Existing approaches primarily focus on crafting new objectives built on already heavily engineered LLM objectives, using ad hoc estimators for likelihood, without a thorough investigation into how such estimation affects overall algorithmic performance. In this work, we provide a systematic analysis of the RL design space by disentangling three factors: i) policy-gradient objectives, ii) likelihood estimators, and iii) rollout sampling schemes. We show that adopting an evidence lower bound (ELBO) based model likelihood estimator, computed only from the final generated sample, is the dominant factor enabling effective, efficient, and stable RL optimization, outweighing the impact of the specific policy-gradient loss functional. We validate our findings across multiple reward benchmarks using SD 3.5 Medium, and observe consistent trends across all tasks. Our method improves the GenEval score from 0.24 to 0.95 in 90 GPU hours, which is $4.6\times$ more efficient than FlowGRPO and $2\times$ more efficient than the SOTA method DiffusionNFT without reward hacking.
Abstract（参考訳）: 強化学習は、テキスト・ツー・イメージ生成のような視覚的タスクの拡散とフローモデルに広く適用されてきた。しかし、拡散モデルには難解な可能性があり、一般的なポリシー勾配型メソッドを直接適用する障壁が生じるため、これらのタスクは依然として困難である。既存のアプローチは、アルゴリズム全体の性能にどのように影響するかを徹底的に調査することなく、おそらくはアドホックな推定器を用いて、既に高度に設計されたLLMの目的に基づいて構築された新しい目的の構築に重点を置いている。本稿では,RL設計空間の系統的解析について,次の3つの要因を解き明かす。一政策段階の目的二推定者の可能性、及び三ロールアウトサンプリング方式最終生成標本からのみ計算されるエビデンス・ローバウンド・モデル推定器(ELBO)を適用すれば、特定のポリシ・グラディエント・ロス関数の影響を上回り、有効で効率的で安定したRL最適化が可能となる。 SD 3.5 Medium を用いて複数の報奨評価を行い,全タスクにおける一貫した傾向を観察した。我々の手法は、90GPU時間でGenEvalのスコアを0.24から0.95に改善し、FlowGRPOよりも4.6\times$、報酬ハックなしでDiffusionNFTよりも2.6\times$効率が良い。

論文の概要: Rethinking the Design Space of Reinforcement Learning for Diffusion Models: On the Importance of Likelihood Estimation Beyond Loss Design

関連論文リスト