Fugu-MT 論文翻訳(概要): Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning

論文の概要: Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning

arxiv url: http://arxiv.org/abs/2604.19009v1
Date: Tue, 21 Apr 2026 02:57:13 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-22 22:41:49.58302
Title: Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning
Title（参考訳）: 勾配型強化学習による配向マッチング蒸留の誘導
Authors: Linwei Dong, Ruoyu Guo, Ge Bai, Zehuan Yuan, Yawei Luo, Changqing Zou,
Abstract要約: 拡散蒸留は、数ステップの発電では大きな可能性を秘めているが、サンプリング速度のために品質を犠牲にすることが多い。 GDMDは, 原画素出力よりも蒸留勾配を優先することで, 報酬機構を再定義する新しいフレームワークである。我々のモデルは、その多段階教師の質を上回り、GenEvalと人為的基準の指標において、従来のDMDRよりもかなり上回っている。
参考スコア（独自算出の注目度）: 41.982957134224904
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion distillation, exemplified by Distribution Matching Distillation (DMD), has shown great promise in few-step generation but often sacrifices quality for sampling speed. While integrating Reinforcement Learning (RL) into distillation offers potential, a naive fusion of these two objectives relies on suboptimal raw sample evaluation. This sample-based scoring creates inherent conflicts with the distillation trajectory and produces unreliable rewards due to the noisy nature of early-stage generation. To overcome these limitations, we propose GDMD, a novel framework that redefines the reward mechanism by prioritizing distillation gradients over raw pixel outputs as the primary signal for optimization. By reinterpreting the DMD gradients as implicit target tensors, our framework enables existing reward models to directly evaluate the quality of distillation updates. This gradient-level guidance functions as an adaptive weighting that synchronizes the RL policy with the distillation objective, effectively neutralizing optimization divergence. Empirical results show that GDMD sets a new SOTA for few-step generation. Specifically, our 4-step models outperform the quality of their multi-step teacher and substantially exceed previous DMDR results in GenEval and human-preference metrics, exhibiting strong scalability potential.
Abstract（参考訳）: 拡散蒸留はDMD(Distributed Matching Distillation)で例示されているが, 数段生成において大きな可能性を秘めているが, サンプリング速度の低下により品質が低下することがしばしばある。 Reinforcement Learning (RL) を蒸留に組み込むことは潜在的に有益であるが、これらの2つの目的の単純な融合は、最適下サンプル評価に依存している。このサンプルに基づくスコアリングは, 蒸留軌道と固有の矛盾を生じさせ, 早期発生のノイズの性質から, 信頼できない報酬をもたらす。これらの制限を克服するために,GDMDを提案する。GDMDは,原画素出力に対する蒸留勾配を最適化の主信号として優先順位付けすることで,報酬機構を再定義する新しいフレームワークである。 DMD勾配を暗黙的なターゲットテンソルとして再解釈することにより、既存の報酬モデルで蒸留更新の品質を直接評価することができる。この勾配レベルの誘導は、RLポリシーを蒸留目標と同期させる適応重み付けとして機能し、最適化のばらつきを効果的に中和する。実験の結果、GDMDは数ステップ生成のための新しいSOTAを設定できることがわかった。具体的には、我々の4段階モデルは、彼らの多段階教師の質よりも優れており、GenEvalや人為的参照の指標において、従来のDMDRよりもかなり上回っており、高いスケーラビリティの可能性が示されています。

論文の概要: Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning

関連論文リスト