Fugu-MT 論文翻訳(概要): Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models

論文の概要: Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models

arxiv url: http://arxiv.org/abs/2605.06070v1
Date: Thu, 07 May 2026 11:56:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-08 22:27:11.740525
Title: Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models
Title（参考訳）: オフライン・リワードとしてのアリーナ:拡散モデルに対する効率的な細粒度選好最適化
Authors: Zhikai Li, Yue Zhao, Edward Zhongwei Zhang, Xuewen Liu, Jing Zhang, Qingyi Gu, Zhen Dong,
Abstract要約: 本稿では、Arenaのスコアをオフラインの報酬として活用し、洗練されたフィードバックを提供するArenaPOを提案する。報酬モデルを必要としないため、オフラインで計算できるため、追加のトレーニングオーバーヘッドは発生しない。我々は、Pick-a-Pic v2とHPD v3データセットでArenaPOトレーニングを実施し、ArenaPOが既存のベースラインを一貫して上回ることを示す。
参考スコア（独自算出の注目度）: 26.065952775368768
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning from human feedback (RLHF) effectively promotes preference alignment of text-to-image (T2I) diffusion models. To improve computational efficiency, direct preference optimization (DPO), which avoids explicit reward modeling, has been widely studied. However, its reliance on binary feedback limits it to coarse-grained modeling on chosen-rejected pairs, resulting in suboptimal optimization. In this paper, we propose ArenaPO, which leverages Arena scores as offline rewards to provide refined feedback, thus achieving efficient and fine-grained optimization without a reward model. This enables ArenaPO to benefit from both the rich rewards of traditional RLHF and the efficiency of DPO. Specifically, we first construct a model Arena in which each model's capability is represented as a Gaussian distribution, and infer these capabilities by traversing the annotated pairwise preferences. Each output image is treated as a sample from the corresponding capability distribution. Then, for a image pair, conditioned on the two capability distributions and the observed pairwise preference, the absolute quality gap is estimated using latent-variable inference based on truncated normal distribution, which serves as fine-grained feedback during training. It does not require a reward model and can be computed offline, thus introducing no additional training overhead. We conduct ArenaPO training on Pick-a-Pic v2 and HPD v3 datasets, showing that ArenaPO consistently outperforms existing baselines.
Abstract（参考訳）: ヒューマンフィードバック(RLHF)からの強化学習は、テキスト・ツー・イメージ(T2I)拡散モデルの選好アライメントを効果的に促進する。計算効率を向上させるために、明示的な報酬モデリングを避けるダイレクト・プライオリティ・最適化(DPO)が広く研究されている。しかし、二項フィードバックへの依存は、選択された排他的ペアの粗いきめ細かなモデリングに制限され、最適化の準最適化をもたらす。本稿では、Arenaのスコアをオフライン報酬として活用し、洗練されたフィードバックを提供することにより、報酬モデルなしで効率よく、きめ細かな最適化を実現するアレナPOを提案する。これにより、ArenaPOは従来のRLHFの豊富な報酬とDPOの効率の両方の恩恵を受けることができる。具体的には、まず各モデルの能力がガウス分布として表されるモデルアリーナを構築し、アノテーション付きペアワイズ選好をトラバースすることでこれらの能力を推測する。各出力画像は、対応する能力分布からサンプルとして処理される。そして、2つの能力分布と観察されたペアの選好に基づいて条件付けされた画像対に対して、学習中にきめ細かなフィードバックとして機能する乱れ正規分布に基づく潜在変数推論を用いて絶対品質ギャップを推定する。報酬モデルを必要としないため、オフラインで計算できるため、追加のトレーニングオーバーヘッドは発生しない。我々は、Pick-a-Pic v2とHPD v3データセットでArenaPOトレーニングを実施し、ArenaPOが既存のベースラインを一貫して上回ることを示す。

論文の概要: Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models

関連論文リスト