Fugu-MT 論文翻訳(概要): VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification

論文の概要: VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification

arxiv url: http://arxiv.org/abs/2604.01569v1
Date: Thu, 02 Apr 2026 03:29:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-03 14:21:10.208421
Title: VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification
Title（参考訳）: VideoZeroBench: 時空間証拠検証によるビデオMLLMの限界の検証
Authors: Jiahao Meng, Tan Yue, Qi Xu, Haochen Wang, Zhongwei Ren, Weisong Liu, Yuhao Wang, Renrui Zhang, Yunhai Tong, Haodong Duan,
Abstract要約: VideoBenchは、証拠を厳格に検証する長ビデオ応答のための階層的なベンチマークだ。これは、13のドメインに500の注釈付き質問を手動で記述し、時間間隔と空間境界ボックスを組み合わせて証拠とする。 GeminiPro-3-Proでさえ、標準のエンドツーエンドのQA設定で17%未満の質問に正しく答えている。その結果,表面レベルでの回答の正しさと真正な証拠に基づく推論との間に大きなギャップが生じた。
参考スコア（独自算出の注目度）: 73.02304272829785
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent video multimodal large language models achieve impressive results across various benchmarks. However, current evaluations suffer from two critical limitations: (1) inflated scores can mask deficiencies in fine-grained visual understanding and reasoning, and (2) answer correctness is often measured without verifying whether models identify the precise spatio-temporal evidence supporting their predictions. To address this, we present VideoZeroBench, a hierarchical benchmark designed for challenging long-video question answering that rigorously verifies spatio-temporal evidence. It comprises 500 manually annotated questions across 13 domains, paired with temporal intervals and spatial bounding boxes as evidence. To disentangle answering generation, temporal grounding, and spatial grounding, we introduce a five-level evaluation protocol that progressively tightens evidence requirements. Experiments show that even Gemini-3-Pro correctly answers fewer than 17% of questions under the standard end-to-end QA setting (Level-3). When grounding constraints are imposed, performance drops sharply: No model exceeds 1% accuracy when both correct answering and accurate spatio-temporal localization are required (Level-5), with most failing to achieve any correct grounded predictions. These results expose a significant gap between surface-level answer correctness and genuine evidence-based reasoning, revealing that grounded video understanding remains a bottleneck for long-video QA. We further analyze performance across minimal evidence spans, atomic abilities, and inference paradigms, providing insights for future research in grounded video reasoning. The benchmark and code will be made publicly available.
Abstract（参考訳）: 最近のビデオマルチモーダル大言語モデルは、様々なベンチマークで印象的な結果が得られる。しかし, 現在の評価は, 1) 膨らませたスコアは, きめ細かな視覚的理解と推論における欠陥を隠蔽しうること, (2) モデルが予測を裏付ける正確な時空間的証拠を識別するかどうかを確かめることなく, 答えの正当性を測定すること,の2つの限界に悩まされている。これを解決するために、ビデオZeroBenchという階層的なベンチマークを提示する。これは、13のドメインに500の注釈付き質問を手動で記述し、時間間隔と空間境界ボックスを組み合わせて証拠とする。応答生成,時間的接地,空間的接地を両立させるため,証拠要求を段階的に強化する5段階評価プロトコルを導入する。実験の結果、Gemini-3-Proでさえ、標準のエンドツーエンドのQA設定(Level-3)で17%未満の質問に正しく答えることがわかった。正しい答えと正確な時空間的局所化の両方が必要な場合、モデルが1%を超えることはない(レベル5)。これらの結果から,地上レベルの回答の正しさと真正な証拠に基づく推論との間に大きなギャップがあることが判明した。さらに、最小限のエビデンス、原子能力、推論パラダイムにわたるパフォーマンスを分析し、地上ビデオ推論における将来の研究の洞察を提供する。ベンチマークとコードは一般公開される予定だ。

論文の概要: VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification

関連論文リスト