Fugu-MT 論文翻訳(概要): VideoRewardBench: Comprehensive Evaluation of Multimodal Reward Models for Video Understanding

論文の概要: VideoRewardBench: Comprehensive Evaluation of Multimodal Reward Models for Video Understanding

arxiv url: http://arxiv.org/abs/2509.00484v1
Date: Sat, 30 Aug 2025 12:50:55 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-04 15:17:03.254724
Title: VideoRewardBench: Comprehensive Evaluation of Multimodal Reward Models for Video Understanding
Title（参考訳）: VideoRewardBench:ビデオ理解のためのマルチモーダルリワードモデルの総合評価
Authors: Zhihong Zhang, Xiaojian Huang, Jin Xu, Zhuodong Luo, Xinzhi Wang, Jiansheng Wei, Xuejin Chen,
Abstract要約: マルチモーダル報酬モデル(MRM)は、LVLM(Large Vision Language Models)の訓練、推論、評価において重要な役割を果たしている。ビデオ領域でMRMを評価するための既存のベンチマークは、限られた数と多様な質問に悩まされている。ビデオ理解の4つの中核的な側面をカバーする最初の総合的なベンチマークであるVideoRewardBenchを紹介する。
参考スコア（独自算出の注目度）: 19.54215281137561
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal reward models (MRMs) play a crucial role in the training, inference, and evaluation of Large Vision Language Models (LVLMs) by assessing response quality. However, existing benchmarks for evaluating MRMs in the video domain suffer from a limited number and diversity of questions, a lack of comprehensive evaluation dimensions, and inadequate evaluation of diverse types of MRMs. To address these gaps, we introduce VideoRewardBench, the first comprehensive benchmark covering four core aspects of video understanding: perception, knowledge, reasoning, and safety. Through our AI-assisted data pipeline, we curate a high-quality preference dataset of 1,563 annotated samples, including 1,482 unique videos and 1,559 distinct questions--15 times the number found in the most question-rich prior benchmark. Each sample is a triplet consisting of a video-text prompt, a chosen response, and a rejected response. We also conduct a comprehensive evaluation across 28 multimodal reward models spanning three categories: generative, discriminative, and semi-scalar. Results show that even the top-performing model GPT-4o achieves only 57.0% overall accuracy, and the state-of-the-art open-source model Qwen2.5-VL-72B reaches merely 53.3%. Our analysis further reveals three key insights: (i) MRMs trained with reinforcement learning (RL) do not necessarily exhibit stronger cross-modal generalization than those trained without RL; (ii) except for discriminative MRMs, other types of MRMs across varying model capacities can benefit from inference-time scaling; and (iii) variations in input video frame count have different effects on different types of MRMs. We believe VideoRewardBench offers a challenging and valuable benchmark for advancing the evaluation and development of MRMs in the video domain.
Abstract（参考訳）: マルチモーダル報酬モデル(MRM)は、応答品質の評価により、LVLM(Large Vision Language Models)の訓練、推論、評価において重要な役割を果たす。しかし、ビデオ領域におけるMRMの評価のための既存のベンチマークでは、限られた数の質問と多様性、包括的な評価次元の欠如、様々な種類のMRMの評価が不十分である。これらのギャップに対処するために、ビデオ理解の4つの中核的な側面(認識、知識、推論、安全性)をカバーする最初の総合的なベンチマークであるVideoRewardBenchを紹介します。 AIによるデータパイプラインを通じて、1,482のユニークなビデオと1,559の異なる質問を含む、1,563の注釈付きサンプルの高品質な選好データセットをキュレートします。各サンプルは、ビデオテキストプロンプト、選択された応答、拒否された応答からなるトリプルトである。また、生成性、識別性、半スカラーの3つのカテゴリにまたがる28のマルチモーダル報酬モデルに対して総合的な評価を行う。その結果、トップパフォーマンスモデルであるGPT-4oでさえ全体の精度は57.0%に過ぎず、最先端のオープンソースモデルであるQwen2.5-VL-72Bは53.3%に留まった。私たちの分析は、さらに3つの重要な洞察を明らかにします。 (i)強化学習(RL)で訓練されたMRMは、RLなしで訓練したMRMよりも強力なクロスモーダル一般化を必ずしも示さない。 (二)差別的MRMを除くと、様々なモデル能力にまたがる他の種類のMRMは、推論時間スケーリングの恩恵を受けることができる。 3) 入力ビデオフレーム数の変化はMRMの種類によって異なる。 VideoRewardBenchは、ビデオ領域におけるMRMの評価と開発を促進する上で、挑戦的で価値のあるベンチマークを提供すると考えています。

論文の概要: VideoRewardBench: Comprehensive Evaluation of Multimodal Reward Models for Video Understanding

関連論文リスト