Fugu-MT 論文翻訳(概要): HPSv3++: Scaling Reward Models Across the Full Spectrum of Diffusion Model Capabilities

論文の概要: HPSv3++: Scaling Reward Models Across the Full Spectrum of Diffusion Model Capabilities

arxiv url: http://arxiv.org/abs/2606.14657v1
Date: Fri, 12 Jun 2026 17:22:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-15 16:00:43.010308
Title: HPSv3++: Scaling Reward Models Across the Full Spectrum of Diffusion Model Capabilities
Title（参考訳）: HPSv3++: 拡散モデル機能の全スペクトルにわたって、リワードモデルをスケーリングする
Authors: Yijun Liu, Jie Huang, Zeyue Xue, Yuming Li, Ruizhe He, Haoran Li, Shijia Ge, Siming Fu,
Abstract要約: 我々は,T2Iモデル機能とRLイテレーションの異なるHPSv3モデルを改善する報奨モデルフレームワークであるHPSv3++を提案する。 HPSv3++は、HPDv3で9.8%、GenAI-Benchで5.5%を上回り、提案したHPDv3++で79.1%/88.1%を達成している。
参考スコア（独自算出の注目度）: 21.055150092997902
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reward models guide text-to-image (T2I) systems toward outputs aligned with human preferences. However, typical reward models such as HPSv3 are trained on pre-annotated data from earlier T2I models, without accounting for quality discriminative shifts arising from evolving model capabilities and reinforcement learning (RL) iterations, limiting their broader applicability. In this work, we propose HPSv3++, a reward model framework that elevates the HPSv3 model for varying T2I model capabilities and their RL iteration changes across the full capability-iteration spectrum. Specifically, we first introduce HPDv3++, a 212K dual-dimension preference dataset annotated for text fidelity and aesthetic quality using a recent high-capability (Qwen-Image) model with human supervision. We then propose a two-stage training framework. Stage 1 employs data-aware orthogonal gradient projection to incorporate diverse aesthetic perception from HPDv3++ while preserving the original effective human preference knowledge in HPSv3. Stage 2 further leverages unlabeled data from T2I models spanning different capability levels and RL iterations, and introduces a joint capability-iterations conditioned signal for the reward model together with a standard deviation-driven unsupervised guidance mechanism, strengthening reward model across the capability-iteration spectrum. HPSv3++ achieves state-of-the-art preference prediction, outperforming HPSv3 9.8% on HPDv3, 5.5% on GenAI-Bench, while achieving 79.1%/88.1% on our proposed HPDv3++. When used for T2I RL training, it consistently improves GenEval scores across diverse T2I models, demonstrating its wide-range capabilities. The code is available at https://github.com/PlantPotatoOnMoon/HPSv3-PlusPlus.
Abstract（参考訳）: Reward Modelは、テキスト・トゥ・イメージ(T2I)システムを人間の好みに沿った出力へと導く。しかし、HPSv3のような典型的な報酬モデルは、進化するモデル能力と強化学習(RL)の反復によって生じる品質上の差別的なシフトを考慮せずに、初期のT2Iモデルの注釈付きデータに基づいて訓練される。そこで本研究では,HPSv3++を提案する。HPSv3モデルは,T2Iモデル能力の変動と,そのRL反復変化を,全機能イテレーションスペクトルにわたって高めることができる。具体的には,HPDv3++を紹介した。HPDv3++はテキストの忠実度と美的品質にアノテートされた212Kの二重次元嗜好データセットで,人間の監督を伴う最近の高機能(Qwen-Image)モデルを用いている。次に、2段階のトレーニングフレームワークを提案する。ステージ1では、HPDv3++からの多様な審美的知覚を取り入れつつ、HPSv3の本来の効果的な人間の嗜好知識を保存するために、データ対応の直交勾配プロジェクションを採用している。ステージ2はさらに、異なる能力レベルとRLイテレーションにまたがるT2Iモデルからのラベルなしデータを活用し、標準偏差駆動による教師なし誘導機構とともに、報酬モデルのための共同能力条件付き信号を導入し、能力評価スペクトルをまたいだ報酬モデルを強化する。 HPSv3++は、HPDv3で9.8%、GenAI-Benchで5.5%を上回り、提案したHPDv3++で79.1%/88.1%を達成している。 T2I RLトレーニングに使用すると、さまざまなT2IモデルのGenEvalスコアを一貫して改善し、その幅広い能力を実証する。コードはhttps://github.com/PlantPotatoOnMoon/HPSv3-PlusPlusで公開されている。

論文の概要: HPSv3++: Scaling Reward Models Across the Full Spectrum of Diffusion Model Capabilities

関連論文リスト