Fugu-MT 論文翻訳(概要): VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding

論文の概要: VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding

arxiv url: http://arxiv.org/abs/2509.21451v1
Date: Thu, 25 Sep 2025 19:22:57 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-29 20:57:53.934436
Title: VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding
Title（参考訳）: VideoJudge: ビデオ理解のためのMLLM-as-a-Judgeのスケーラブルなスーパービジョンを実現するブートストラップ
Authors: Abdul Waheed, Zhen Wu, Dareen Alharthi, Seungone Kim, Bhiksha Raj,
Abstract要約: ビデオ理解モデルから出力を評価するための3Bおよび7BサイズのMLLM判定器であるVideoJudgeを紹介する。 VideoJudgeのトレーニングには、ジェネレータと評価器の相互作用に基づいてレシピを構築します。 4つのメタ評価ベンチマークのうち3つで、VideoJudge-7BはMLLMの判断基準を大きく上回っている。
参考スコア（独自算出の注目度）: 57.15309719147799
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Precisely evaluating video understanding models remains challenging: commonly used metrics such as BLEU, ROUGE, and BERTScore fail to capture the fineness of human judgment, while obtaining such judgments through manual evaluation is costly. Recent work has explored using large language models (LLMs) or multimodal LLMs (MLLMs) as evaluators, but their extension to video understanding remains relatively unexplored. In this work, we introduce VideoJudge, a 3B and 7B-sized MLLM judge specialized to evaluate outputs from video understanding models (\textit{i.e.}, text responses conditioned on videos). To train VideoJudge, our recipe builds on the interplay between a generator and an evaluator: the generator is prompted to produce responses conditioned on a target rating, and responses not matching the evaluator's rating are discarded. Across three out of four meta-evaluation benchmarks, VideoJudge-7B outperforms larger MLLM judge baselines such as Qwen2.5-VL (32B and 72B). Notably, we find that LLM judges (Qwen3) models perform worse than MLLM judges (Qwen2.5-VL) and long chain-of-thought reasoning does not improve performance, indicating that providing video inputs is crucial for evaluation of video understanding tasks.
Abstract（参考訳）: BLEU、ROUGE、BERTScoreなどの一般的なメトリクスは人間の判断の微妙さを捉えることができず、手動による判断はコストがかかる。近年,大規模言語モデル (LLM) やマルチモーダルLLM (MLLM) を評価対象として研究されているが,ビデオ理解への拡張はいまだに未検討である。本稿では,ビデオ理解モデルの出力を評価するための3Bおよび7BサイズのMLLM判定器であるVideoJudgeを紹介する。 VideoJudgeのトレーニングには、ジェネレータと評価器の相互作用に基づいて、ジェネレータがターゲットレーティングに条件付き応答を生成し、評価器のレーティングにマッチしない応答が破棄されるように促される。 4つのメタ評価ベンチマークのうち3つで、 VideoJudge-7B は Qwen2.5-VL (32B と 72B) のようなMLLM の判断基準よりも優れている。特に,LLM判定器(Qwen2.5-VL)はMLLM判定器(Qwen2.5-VL)よりも性能が悪く,長鎖推論では性能が向上せず,映像理解タスクの評価には映像入力の提供が不可欠であることを示す。

論文の概要: VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding

関連論文リスト