Fugu-MT 論文翻訳(概要): You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass

論文の概要: You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass

arxiv url: http://arxiv.org/abs/2604.10966v2
Date: Wed, 15 Apr 2026 22:13:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-17 16:09:14.151896
Title: You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass
Title（参考訳）: 一度だけ判断する: シングルフォワードパスでのマルチレスポンスリワードモデリング
Authors: Yinuo Yang, Zixian Ma, Manasi Ganti, Jieyu Zhang, Ranjay Krishna,
Abstract要約: 本稿では,1回の前方通過で全ての候補応答を判定する識別的マルチモーダル報酬モデルを提案する。マルチレスポンス設計では、従来のシングルレスポンススコアよりも最大$Ntimes$ウォールクロックのスピードアップとFLOPの削減も得られる。我々のモデルは、既存のより大きな生成的および差別的な報酬モデルよりも優れています。
参考スコア（独自算出の注目度）: 40.11359880802771
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present a discriminative multimodal reward model that scores all candidate responses in a single forward pass. Conventional discriminative reward models evaluate each response independently, requiring multiple forward passes, one for each potential response. Our approach concatenates multiple responses with separator tokens and applies cross-entropy over their scalar scores, enabling direct comparative reasoning and efficient $N$-way preference learning. The multi-response design also yields up to $N\times$ wall-clock speedup and FLOPs reduction over conventional single-response scoring. To enable $N$-way reward evaluation beyond existing pairwise benchmarks, we construct two new benchmarks: (1) MR$^2$Bench-Image contains human-annotated rankings over responses from 8 diverse models; (2) MR$^2$Bench-Video is a large-scale video-based reward benchmark derived from 94K crowdsourced pairwise human judgments over video question-answering spanning 19 models, denoised via preference graph ensemble. Both benchmarks provide 4-response evaluation variants sampled from the full rankings. Built on a 4B vision-language backbone with LoRA fine-tuning and a lightweight MLP value head, our model achieves state-of-the-art results on six multimodal reward benchmarks, including MR$^2$Bench-Image, MR$^2$Bench-Video, and four other existing benchmarks. Our model outperforms existing larger generative and discriminative reward models. We further demonstrate that our reward model, when used in reinforcement learning with GRPO, produces improved policy models that maintain performance across standard multimodal benchmarks while substantially improving open-ended generation quality, outperforming a single-response discriminative reward model (RM) baseline by a large margin in both training stability and open-ended generation quality.
Abstract（参考訳）: 本稿では,1つの前方パスで全ての候補応答をスコアする識別的マルチモーダル報酬モデルを提案する。従来の差別報酬モデルでは、各応答を独立に評価し、各応答に対して複数の前方通過が必要となる。提案手法は,複数の応答をセパレータトークンと結合し,スカラースコアにクロスエントロピーを適用し,直接比較推論と効率的な$N$-way選好学習を可能にする。マルチレスポンス設計では、従来のシングルレスポンススコアよりも最大$N\times$wall- clock speedupとFLOPsが削減される。 1)MR$^2$Bench-Imageには8つのモデルからの応答に対する人間の注釈付きランキングが含まれており、(2)MR$^2$Bench-Videoは94Kのクラウドソーシングによる19のモデルに対するペアワイドな判断から派生した大規模ビデオベース報酬ベンチマークである。両方のベンチマークは、完全なランキングからサンプリングされた4-レスポンス評価のバリエーションを提供する。 MR$2$Bench-Image,MR$2$Bench-Video,および他の4つの既存ベンチマークを含む6つのマルチモーダル報酬ベンチマークにおいて,LoRAファインチューニングと軽量MLP値ヘッドを備えた4Bビジョン言語バックボーン上に構築した。我々のモデルは、既存のより大きな生成的および差別的な報酬モデルよりも優れています。さらに, GRPOを用いた強化学習において, 標準マルチモーダルベンチマークにおける性能向上と, オープンエンド生成品質の大幅な向上を図り, トレーニング安定性とオープンエンド生成品質の両面において, 単一応答型識別報酬モデル(RM)ベースラインよりも優れた性能を示した。

論文の概要: You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass

関連論文リスト