Fugu-MT 論文翻訳(概要): Beyond Pairwise Preferences: Listwise Reward-Aware Alignment for Diffusion Models

論文の概要: Beyond Pairwise Preferences: Listwise Reward-Aware Alignment for Diffusion Models

arxiv url: http://arxiv.org/abs/2605.26491v1
Date: Tue, 26 May 2026 03:09:24 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-27 17:51:41.596353
Title: Beyond Pairwise Preferences: Listwise Reward-Aware Alignment for Diffusion Models
Title（参考訳）: ペアワイズ選好を超えて:拡散モデルに対するリスワイズ・リワード・アウェアアライメント
Authors: Austin Wang, Jiaqi Han, Stefano Ermon, Yisong Yue,
Abstract要約: Diffusion LAIRは、拡散モデルに対する報酬を考慮したリストワイズ選好最適化手法である。実験により、テキスト・ツー・イメージ生成、合成生成、画像編集ベンチマークにおいて、強い優先最適化ベースラインを上回ります。
参考スコア（独自算出の注目度）: 73.08789211016567
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Preference optimization has emerged as an efficient alternative to online reinforcement learning from human feedback (RLHF) for aligning text-to-image diffusion models. However, existing methods largely reduce supervision to binary pairwise comparisons. This pairwise reduction is limiting when training data naturally contains multiple candidate images for the same prompt, and when continuous reward scores can provide richer information than a single winner-loser label. To address these limitations, we propose Diffusion LAIR, a reward-aware listwise preference optimization method for diffusion models. For each prompt, LAIR converts reward scores across a group of candidate images into centered advantage weights, then optimizes an advantage-weighted regression objective on the implicit reward, defined as the denoising-loss improvement of the current model over a fixed reference model, with a quadratic penalty that regularizes the magnitude of the implicit reward. The resulting objective uses all candidates simultaneously rather than selecting pairs, and remains conservative by explicitly controlling the magnitude of the implicit reward. The LAIR objective admits a bounded closed-form optimum in implicit-reward space, clarifying how the regularization strength controls the magnitude of the preference update. Experiments show that Diffusion LAIR outperforms strong preference optimization baselines on SD1.5 and SDXL across text-to-image generation, compositional generation, and image editing benchmarks.
Abstract（参考訳）: テキストと画像の拡散モデルを調整するために、人間のフィードバック(RLHF)からオンライン強化学習に代わる効果的な方法として、優先度最適化が登場した。しかし、既存の手法は二対比較の監督を大幅に減らしている。このペアワイズ削減は、トレーニングデータが同一のプロンプトに対して複数の候補画像を自然に含む場合に制限され、連続報酬スコアが1つの勝者ロザラベルよりもリッチな情報を提供できる場合に制限される。これらの制約に対処するため、拡散モデルに対する報酬を考慮したリストワイズ選好最適化手法であるDiffusion LAIRを提案する。それぞれのプロンプトに対して、LAIRは、候補画像群にわたる報酬スコアを中心となる有利な重み付けに変換し、次に、固定参照モデルよりも現在のモデルのノイズロス改善として定義された暗黙の報酬に対する有利な回帰目標を最適化し、暗黙の報酬の規模を規則化する二次的なペナルティを持つ。結果として得られる目的は、ペアを選択するのではなく、すべての候補を同時に使用し、暗黙の報酬の規模を明示的に制御することで保守的である。 LAIRの目的は、暗黙の逆空間における有界閉形式最適化を認め、正規化強度が優先更新の規模をいかに制御するかを明確にする。 Diffusion LAIRは、テキスト・ツー・イメージ生成、合成生成、画像編集ベンチマークにおいて、SD1.5とSDXLの強い優先最適化ベースラインよりも優れていた。

論文の概要: Beyond Pairwise Preferences: Listwise Reward-Aware Alignment for Diffusion Models

関連論文リスト