Fugu-MT 論文翻訳(概要): Ranking-based Preference Optimization for Diffusion Models from Implicit User Feedback

論文の概要: Ranking-based Preference Optimization for Diffusion Models from Implicit User Feedback

arxiv url: http://arxiv.org/abs/2510.18353v1
Date: Tue, 21 Oct 2025 07:22:34 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:13.119322
Title: Ranking-based Preference Optimization for Diffusion Models from Implicit User Feedback
Title（参考訳）: 意図しないユーザフィードバックからの拡散モデルのランク付けに基づく選好最適化
Authors: Yi-Lun Wu, Bo-Kai Ruan, Chiang Tseng, Hong-Han Shuai,
Abstract要約: Diffusion Denoising Ranking Optimization (Diffusion-DRO) は、逆強化学習に基づく新しい好み学習フレームワークである。拡散DROは、選好学習をランキング問題としてキャストすることで、報酬モデルへの依存を除去する。オフラインの専門家によるデモとオンラインポリシー生成のネガティブなサンプルを統合することで、人間の好みを効果的に捉えることができる。
参考スコア（独自算出の注目度）: 28.40216934244641
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Direct preference optimization (DPO) methods have shown strong potential in aligning text-to-image diffusion models with human preferences by training on paired comparisons. These methods improve training stability by avoiding the REINFORCE algorithm but still struggle with challenges such as accurately estimating image probabilities due to the non-linear nature of the sigmoid function and the limited diversity of offline datasets. In this paper, we introduce Diffusion Denoising Ranking Optimization (Diffusion-DRO), a new preference learning framework grounded in inverse reinforcement learning. Diffusion-DRO removes the dependency on a reward model by casting preference learning as a ranking problem, thereby simplifying the training objective into a denoising formulation and overcoming the non-linear estimation issues found in prior methods. Moreover, Diffusion-DRO uniquely integrates offline expert demonstrations with online policy-generated negative samples, enabling it to effectively capture human preferences while addressing the limitations of offline data. Comprehensive experiments show that Diffusion-DRO delivers improved generation quality across a range of challenging and unseen prompts, outperforming state-of-the-art baselines in both both quantitative metrics and user studies. Our source code and pre-trained models are available at https://github.com/basiclab/DiffusionDRO.
Abstract（参考訳）: 直接選好最適化(DPO)法は、ペア比較によるトレーニングにより、テキストと画像の拡散モデルと人間の選好との整合性を強く示している。これらの手法は、REINFORCEアルゴリズムを回避してトレーニング安定性を向上させるが、Sigmoid関数の非線形性やオフラインデータセットの限られた多様性による画像確率の正確な推定といった課題に苦慮している。本稿では,逆強化学習に基づく新たな選好学習フレームワークであるDiffusion Denoising Ranking Optimization (Diffusion-DRO)を紹介する。拡散DROは、選好学習をランク付け問題としてキャストすることで報酬モデルへの依存を排除し、トレーニング対象をデノナイズ形式に単純化し、従来手法で見られた非線形推定問題を克服する。さらに、Diffusion-DROはオフライン専門家のデモンストレーションをオンラインポリシー生成のネガティブなサンプルと一意に統合することで、オフラインデータの制限に対処しながら、人間の好みを効果的に捉えることができる。総合的な実験によると、Diffusion-DROは、さまざまな困難かつ目に見えないプロンプトにわたって、生成品質を向上し、定量的メトリクスとユーザスタディの両方において、最先端のベースラインを上回っている。ソースコードと事前トレーニングされたモデルはhttps://github.com/basiclab/DiffusionDRO.comで公開されています。

論文の概要: Ranking-based Preference Optimization for Diffusion Models from Implicit User Feedback

関連論文リスト