Fugu-MT 論文翻訳(概要): Visual Preference Optimization with Rubric Rewards

論文の概要: Visual Preference Optimization with Rubric Rewards

arxiv url: http://arxiv.org/abs/2604.13029v1
Date: Tue, 14 Apr 2026 17:58:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-15 19:11:32.606547
Title: Visual Preference Optimization with Rubric Rewards
Title（参考訳）: Rubric Rewardsによる視覚的嗜好の最適化
Authors: Ya-Qi Yu, Fangyu Hong, Xiangyang Qu, Hao Wang, Gaojie Wu, Qiaoyu Luo, Nuo Xu, Huixin Wang, Wuheng Xu, Yongxin Liao, Zihao Chen, Haonan Li, Ziming Li, Dezhi Peng, Minghui Liao, Jihao Wu, Haoyu Ren, Dandan Tu,
Abstract要約: 本稿では,インスタンス固有のルーリックをベースとした優先最適化フレームワークであるrDPOを提案する。公開報酬モデルベンチマークでは、ルーリックベースのプロンプトにより30B-A3Bの判定が大幅に改善され、GPT-5.4に近づいた。包括的なベンチマークでスケーラビリティを評価する場合、rDPOは61.01に達し、スタイル制約付きベースライン(52.36)を著しく上回り、59.48ベースモデルを上回っている。
参考スコア（独自算出の注目度）: 30.826907231502663
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The effectiveness of Direct Preference Optimization (DPO) depends on preference data that reflect the quality differences that matter in multimodal tasks. Existing pipelines often rely on off-policy perturbations or coarse outcome-based signals, which are not well suited to fine-grained visual reasoning. We propose rDPO, a preference optimization framework based on instance-specific rubrics. For each image-instruction pair, we create a checklist-style rubric of essential and additional criteria to score responses from any possible policies. The instruction-rubric pool is built offline and reused during the construction of on-policy data. On public reward modeling benchmarks, rubric-based prompting massively improves a 30B-A3B judge and brings it close to GPT-5.4. On public downstream benchmarks, rubric-based filtering raises the macro average to 82.69, whereas outcome-based filtering drops it to 75.82 from 81.14. When evaluating scalability on a comprehensive benchmark, rDPO achieves 61.01, markedly outperforming the style-constrained baseline (52.36) and surpassing the 59.48 base model. Together, these results show that visual preference optimization benefits from combining on-policy data construction with instance-specific criterion-level feedback.
Abstract（参考訳）: 直接選好最適化(DPO)の有効性は、マルチモーダルタスクにおいて重要な品質差を反映した選好データに依存する。既存のパイプラインは、しばしば政治外の摂動や粗い結果に基づく信号に頼っているが、それはきめ細かい視覚的推論には適していない。本稿では,インスタンス固有のルーリックに基づく優先最適化フレームワークであるrDPOを提案する。各イメージインストラクションペアに対して、本質的なチェックリストスタイルのルーリックと、可能なポリシから応答を評価するための追加の基準を作成します。インストラクション・ルブリック・プールはオフラインで構築され、オン・ポリティクス・データの構築中に再利用される。公開報酬モデルベンチマークでは、ルーリックベースのプロンプトにより30B-A3Bの判定が大幅に改善され、GPT-5.4に近づいた。パブリックダウンストリームのベンチマークでは、ルーリックベースのフィルタリングはマクロ平均82.69に上昇し、結果ベースのフィルタリングは81.14から75.82に低下する。包括的なベンチマークでスケーラビリティを評価する場合、rDPOは61.01に達し、スタイル制約付きベースライン(52.36)を著しく上回り、59.48ベースモデルを上回っている。これらの結果から, オンラインデータ構築とインスタンス固有の基準レベルのフィードバックを組み合わせることにより, 視覚的嗜好最適化の利点が示された。

論文の概要: Visual Preference Optimization with Rubric Rewards

関連論文リスト