Fugu-MT 論文翻訳(概要): MVPBench: A Multi-Video Perception Evaluation Benchmark for Multi-Modal Video Understanding

論文の概要: MVPBench: A Multi-Video Perception Evaluation Benchmark for Multi-Modal Video Understanding

arxiv url: http://arxiv.org/abs/2603.22756v1
Date: Tue, 24 Mar 2026 03:33:07 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-25 19:53:37.280908
Title: MVPBench: A Multi-Video Perception Evaluation Benchmark for Multi-Modal Video Understanding
Title（参考訳）: MVPBench:マルチモーダルビデオ理解のためのマルチビデオ知覚評価ベンチマーク
Authors: Purui Bai, Tao Wu, Jiayang Sun, Xinyue Liu, Huaibo Huang, Ran He,
Abstract要約: 既存のベンチマークは、静的画像やシングルビデオに限られており、複数のビデオにわたる複雑なインタラクションを見下ろしている。 MVPBenchは、ビデオシーケンスから関連情報を抽出して情報決定を行うモデルを評価するために設計された、14のサブタスクを備えた、新しいベンチマークである。 MVPBenchには、既存のデータセットと手動で注釈付きクリップから得られた2.7Kのビデオクリップを含む5Kの質問回答テストが含まれている。
参考スコア（独自算出の注目度）: 36.60861786811499
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid progress of Large Language Models (LLMs) has spurred growing interest in Multi-modal LLMs (MLLMs) and motivated the development of benchmarks to evaluate their perceptual and comprehension abilities. Existing benchmarks, however, are limited to static images or single videos, overlooking the complex interactions across multiple videos. To address this gap, we introduce the Multi-Video Perception Evaluation Benchmark (MVPBench), a new benchmark featuring 14 subtasks across diverse visual domains designed to evaluate models on extracting relevant information from video sequences to make informed decisions. MVPBench includes 5K question-answering tests involving 2.7K video clips sourced from existing datasets and manually annotated clips. Extensive evaluations reveal that current models struggle to process multi-video inputs effectively, underscoring substantial limitations in their multi-video comprehension. We anticipate MVPBench will drive advancements in multi-video perception.
Abstract（参考訳）: LLM(Large Language Models)の急速な進歩により、MLLM(Multi-modal LLM)への関心が高まり、その知覚と理解能力を評価するためのベンチマークの開発が動機となった。しかし既存のベンチマークは、静的画像やシングルビデオに限られており、複数のビデオ間の複雑な相互作用を見下ろしている。このギャップに対処するために、ビデオシーケンスから関連情報を抽出して情報決定を行うモデルを評価するために設計された、様々な視覚領域にまたがる14のサブタスクを特徴とする新しいベンチマークである、Multi-Video Perception Evaluation Benchmark (MVPBench)を紹介した。 MVPBenchには、既存のデータセットと手動で注釈付きクリップから得られた2.7Kのビデオクリップを含む5Kの質問回答テストが含まれている。大規模な評価では、現在のモデルはマルチビデオ入力を効果的に処理するのに苦労しており、マルチビデオの理解にかなりの制限があることが示されている。我々はMVPBenchがマルチビデオ知覚の進歩を促進することを期待する。

論文の概要: MVPBench: A Multi-Video Perception Evaluation Benchmark for Multi-Modal Video Understanding

関連論文リスト