Fugu-MT 論文翻訳(概要): VNU-Bench: A Benchmarking Dataset for Multi-Source Multimodal News Video Understanding

論文の概要: VNU-Bench: A Benchmarking Dataset for Multi-Source Multimodal News Video Understanding

arxiv url: http://arxiv.org/abs/2601.03434v1
Date: Tue, 06 Jan 2026 21:42:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-09 02:15:23.079476
Title: VNU-Bench: A Benchmarking Dataset for Multi-Source Multimodal News Video Understanding
Title（参考訳）: VNU-Bench:マルチソースマルチモーダルニュースビデオ理解のためのベンチマークデータセット
Authors: Zibo Liu, Muyang Li, Zhe Jiang, Shigang Chen,
Abstract要約: 本稿では,ニュース領域におけるマルチソース・クロスビデオ理解のための最初のベンチマークであるVNU-Benchを紹介する。様々な角度からマルチソース・マルチモーダル・ニュースを理解するための実験モデルに特有の新しい質問タイプを設計する。データセットには429のニュースグループ、1,405の動画、2,501の高品質な質問が含まれている。
参考スコア（独自算出の注目度）: 15.757734298648634
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: News videos are carefully edited multimodal narratives that combine narration, visuals, and external quotations into coherent storylines. In recent years, there have been significant advances in evaluating multimodal large language models (MLLMs) for news video understanding. However, existing benchmarks largely focus on single-source, intra-video reasoning, where each report is processed in isolation. In contrast, real-world news consumption is inherently multi-sourced: the same event is reported by different outlets with complementary details, distinct narrative choices, and sometimes conflicting claims that unfold over time. Robust news understanding, therefore, requires models to compare perspectives from different sources, align multimodal evidence across sources, and synthesize multi-source information. To fill this gap, we introduce VNU-Bench, the first benchmark for multi-source, cross-video understanding in the news domain. We design a set of new question types that are unique in testing models' ability of understanding multi-source multimodal news from a variety of different angles. We design a novel hybrid human-model QA generation process that addresses the issues of scalability and quality control in building a large dataset for cross-source news understanding. The dataset comprises 429 news groups, 1,405 videos, and 2,501 high-quality questions. Comprehensive evaluation of both closed- and open-source multimodal models shows that VNU-Bench poses substantial challenges for current MLLMs.
Abstract（参考訳）: ニュースビデオは、ナレーション、視覚、外的引用を一貫性のあるストーリーラインに組み合わせた、注意深く編集されたマルチモーダルな物語である。近年,ニュースビデオ理解のためのマルチモーダル大言語モデル (MLLM) の評価が著しく進歩している。しかし、既存のベンチマークは主に単一ソース、ビデオ内推論に焦点を当てており、各レポートは独立して処理される。対照的に、現実のニュース消費は本質的にマルチソースであり、同じ出来事は、補完的な詳細、異なる物語の選択、時には時間が経つにつれて広がる主張と矛盾する、様々なメディアによって報告される。したがって、ロバストなニュース理解は、異なるソースからの視点を比較し、複数のソースをまたいだマルチモーダルエビデンスを調整し、マルチソース情報を合成するモデルを必要とする。このギャップを埋めるために、ニュース領域におけるマルチソース・クロスビデオ理解のための最初のベンチマークであるVNU-Benchを紹介する。様々な角度からマルチソース・マルチモーダル・ニュースを理解するための実験モデルに特有の新しい質問タイプを設計する。我々は、クロスソースニュース理解のための大規模なデータセットを構築する際に、スケーラビリティと品質管理の問題に対処する、新しいハイブリッドなヒューマンモデルQA生成プロセスを設計する。データセットには429のニュースグループ、1,405の動画、2,501の高品質な質問が含まれている。クローズドおよびオープンソース両方のマルチモーダルモデルの包括的評価は、VNU-Benchが現在のMLLMに重大な課題をもたらすことを示している。

論文の概要: VNU-Bench: A Benchmarking Dataset for Multi-Source Multimodal News Video Understanding

関連論文リスト