Fugu-MT 論文翻訳(概要): Diffusion Reinforcement Learning via Centered Reward Distillation

論文の概要: Diffusion Reinforcement Learning via Centered Reward Distillation

arxiv url: http://arxiv.org/abs/2603.14128v1
Date: Sat, 14 Mar 2026 21:29:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 16:19:35.620924
Title: Diffusion Reinforcement Learning via Centered Reward Distillation
Title（参考訳）: センタード・リワード蒸留による拡散強化学習
Authors: Yuanzhi Zhu, Xi Wang, Stéphane Lathuilière, Vicky Kalogeiton,
Abstract要約: フォワードプロセス微細チューニング上に構築されたKL正規化報酬モデルから導出した拡散RLフレームワークである textbf Reward Distillation (CRD) を提案する。信頼性の高いテキスト・画像の微調整を可能にするため,分布のドリフトを明示的に制御する手法を提案する。 textttGenEval と textttOCR rewards によるテキスト・ツー・イメージのポストトレーニング実験では、競合する SOTA の報酬最適化が高速収束と報酬ハッキングの好みの低減をもたらすことが示された。
参考スコア（独自算出の注目度）: 35.979608265594685
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion and flow models achieve State-Of-The-Art (SOTA) generative performance, yet many practically important behaviors such as fine-grained prompt fidelity, compositional correctness, and text rendering are weakly specified by score or flow matching pretraining objectives. Reinforcement Learning (RL) fine-tuning with external, black-box rewards is a natural remedy, but diffusion RL is often brittle. Trajectory-based methods incur high memory cost and high-variance gradient estimates; forward-process approaches converge faster but can suffer from distribution drift, and hence reward hacking. In this work, we present \textbf{Centered Reward Distillation (CRD)}, a diffusion RL framework derived from KL-regularized reward maximization built on forward-process-based fine-tuning. The key insight is that the intractable normalizing constant cancels under \emph{within-prompt centering}, yielding a well-posed reward-matching objective. To enable reliable text-to-image fine-tuning, we introduce techniques that explicitly control distribution drift: (\textit{i}) decoupling the sampler from the moving reference to prevent ratio-signal collapse, (\textit{ii}) KL anchoring to a CFG-guided pretrained model to control long-run drift and align with the inference-time semantics of the pre-trained model, and (\textit{iii}) reward-adaptive KL strength to accelerate early learning under large KL regularization while reducing late-stage exploitation of reward-model loopholes. Experiments on text-to-image post-training with \texttt{GenEval} and \texttt{OCR} rewards show that CRD achieves competitive SOTA reward optimization results with fast convergence and reduced reward hacking, as validated on unseen preference metrics.
Abstract（参考訳）: 拡散モデルと流れモデルにより、状態-Of-The-Art(SOTA)生成性能が達成されるが、微粒なプロンプト忠実度、構成的正確性、テキストレンダリングといった多くの実践的な重要な挙動は、スコアやフローマッチング事前学習目標によって弱い特定がなされる。強化学習(RL) 外部のブラックボックス報酬を用いた微調整は自然な治療法であるが、拡散RLは脆弱であることが多い。トラジェクトリベースの手法は、高メモリコストと高分散勾配推定を発生させ、フォワードプロセスのアプローチはより高速に収束するが、分布のドリフトに悩まされ、それによって報酬のハッキングを行う。本稿では,KL-正規化報酬最大化に基づく拡散RLフレームワークであるtextbf{Centered Reward Distillation (CRD)について述べる。キーとなる洞察は、難解な正規化定数は \emph{within-prompt centering} の下でキャンセルされ、十分な報酬マッチングの目的が得られるということである。そこで,本研究では,分散ドリフトを明示的に制御する手法を提案する。 (\textit{i}) サンプルを移動基準から切り離して比信号崩壊を防止する, (\textit{ii}) KL を CFG 誘導事前学習モデルにアンカリングして長周期ドリフトを制御し,事前学習モデルの推論時間意味と整合させる, (\textit{iii}) 報酬適応型KL 強度により,KL 正規化下での早期学習を加速し,報酬モデルループホールの後期的利用を減少させる,。テキスト・トゥ・イメージ・ポスト・トレーニングにおいて, テキスト・トゥ・イメージ・トレーニングにおいて, テキスト・トゥ・イメージ・トレーニングにおいて, テキスト・トゥ・トレーニングにおいて, テキスト・トゥ・イメージ・トレーニングにおいて, テキスト・トゥ・イメージ・トレーニングにおいて, コンバージェンスと報酬のハッキングを減らし, 競争力のあるSOTA報酬最適化を達成できることが実証された。

論文の概要: Diffusion Reinforcement Learning via Centered Reward Distillation

関連論文リスト