Fugu-MT 論文翻訳(概要): Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function

論文の概要: Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function

arxiv url: http://arxiv.org/abs/2512.04559v1
Date: Thu, 04 Dec 2025 08:21:52 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-05 21:11:46.066085
Title: Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function
Title（参考訳）: ソフトQ-Functionのパラメータ化ポリシ勾配による拡散微調整
Authors: Hyeongyu Kang, Jaewoo Lee, Woocheol Shin, Kiyoung Om, Jinkyoo Park,
Abstract要約: 拡散モデルは高濃度のサンプルを生成するのに優れるが、しばしば下流の目的と整合する必要がある。拡散アライメントのための新しいKL正規化RL法である textbfSoft Q-based Diffusion Finetuning (SQDF) を提案する。 SQDFはテキストと画像のアライメントの多様性を維持しながら、優れた目標報酬を達成する。
参考スコア（独自算出の注目度）: 25.182340618001792
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion models excel at generating high-likelihood samples but often require alignment with downstream objectives. Existing fine-tuning methods for diffusion models significantly suffer from reward over-optimization, resulting in high-reward but unnatural samples and degraded diversity. To mitigate over-optimization, we propose \textbf{Soft Q-based Diffusion Finetuning (SQDF)}, a novel KL-regularized RL method for diffusion alignment that applies a reparameterized policy gradient of a training-free, differentiable estimation of the soft Q-function. SQDF is further enhanced with three innovations: a discount factor for proper credit assignment in the denoising process, the integration of consistency models to refine Q-function estimates, and the use of an off-policy replay buffer to improve mode coverage and manage the reward-diversity trade-off. Our experiments demonstrate that SQDF achieves superior target rewards while preserving diversity in text-to-image alignment. Furthermore, in online black-box optimization, SQDF attains high sample efficiency while maintaining naturalness and diversity.
Abstract（参考訳）: 拡散モデルは高濃度のサンプルを生成するのに優れるが、しばしば下流の目的と整合する必要がある。既存の拡散モデルのための微調整法は、過度な最適化の報奨に著しく悩まされ、高逆だが不自然なサンプルと劣化した多様性をもたらす。過度な最適化を緩和するために, ソフトQ関数の再パラメータ化ポリシ勾配を適用した分散アライメントのための新しいKL正規化RL法である \textbf{Soft Q-based Diffusion Finetuning (SQDF)} を提案する。 SQDFはさらに3つのイノベーションで強化されており、デノナイジングプロセスにおける適切なクレジット割り当てのための割引係数、Q関数の推定を洗練するための一貫性モデルの統合、モードカバレッジを改善し、報酬と多様性のトレードオフを管理するためのオフ・ポリティ・リプレイバッファの利用である。実験により,SQDFはテキストと画像のアライメントの多様性を保ちながら,より優れた目標報酬を達成することが示された。さらに、オンラインのブラックボックス最適化において、SQDFは自然性と多様性を維持しながら高いサンプリング効率を達成する。

論文の概要: Diffusion Fine-Tuning via Reparameterized Policy Gradient of the Soft Q-Function

関連論文リスト