Fugu-MT 論文翻訳(概要): OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs

論文の概要: OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs

arxiv url: http://arxiv.org/abs/2510.10689v1
Date: Sun, 12 Oct 2025 16:34:00 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 18:06:30.066207
Title: OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs
Title（参考訳）: OmniVideoBench:Omni MLLMの音声視覚理解評価に向けて
Authors: Caorui Li, Yu Chen, Yiyan Ji, Jin Xu, Zhenyu Cui, Shihao Li, Yuanxing Zhang, Jiafu Tang, Zhenghao Song, Dingling Zhang, Ying He, Haoxiang Liu, Yuxuan Wang, Qiufeng Wang, Zhenhe Wu, Jiehui Luo, Zhiyu Pan, Weihao Xie, Chenchen Zhang, Zhaohui Wang, Jiayi Tian, Yanghai Wang, Zhe Cao, Minxin Dai, Ke Wang, Runzhe Wen, Yinghao Ma, Yaning Pan, Sungkyun Chang, Termeh Taheri, Haiwen Xia, Christos Plachouras, Emmanouil Benetos, Yizhi Li, Ge Zhang, Jian Yang, Tianhao Peng, Zili Wang, Minghao Liu, Junran Peng, Zhaoxiang Zhang, Jiaheng Liu,
Abstract要約: 音声・視覚の相乗的理解を評価するためのベンチマークであるOmniVideoBenchを紹介する。 OmniVideoBenchは1000の高品質なQA(QA)ペアで構成され、それぞれにステップバイステップの推論トレースが付加されている。我々はOmniVideoBenchをリリースし、より強力でより一般化可能な推論機能を持つMLLMの開発を促進する。
参考スコア（独自算出の注目度）: 72.425061028374
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recent advances in multimodal large language models (MLLMs) have demonstrated substantial potential in video understanding. However, existing benchmarks fail to comprehensively evaluate synergistic reasoning capabilities across audio and visual modalities, often neglecting either one of the modalities or integrating them in a logically inconsistent manner. To bridge this gap, we introduce OmniVideoBench, a large-scale and rigorously designed benchmark dedicated to assessing synergistic audio-visual understanding, with a strong emphasis on modality complementarity and logical consistency. Specifically, OmniVideoBench comprises 1000 high-quality question-answer(QA) pairs, each annotated with step-by-step reasoning traces, derived from 628 diverse videos ranging from several seconds to 30 minutes, and manually verified to guarantee complete correctness and uniqueness. Moreover, OmniVideoBench encompasses 13 carefully designed question types, covering temporal reasoning, spatial localization, counting, causal inference, summarization, and beyond, thereby capturing the essential challenges of video understanding. Evaluation of multiple MLLMs on OmniVideoBench reveals a pronounced gap between model performance and human reasoning, with open-source models lagging significantly behind their closed-source counterparts, underscoring the inherent difficulty of genuine audio-visual reasoning. We will release OmniVideoBench to foster the development of MLLMs with stronger and more generalizable reasoning capabilities.
Abstract（参考訳）: マルチモーダル大言語モデル(MLLM)の最近の進歩は、ビデオ理解において大きな可能性を秘めている。しかし、既存のベンチマークでは、音声と視覚のモダリティ間の相乗的推論能力を総合的に評価することができず、しばしばモダリティのいずれかを無視したり、論理的に一貫性のない方法でそれらを統合したりする。このギャップを埋めるために,我々は,音質の相補性と論理的整合性に強い重点を置いて,相乗的音声視覚理解を評価するための大規模かつ厳密に設計されたベンチマークであるOmniVideoBenchを紹介した。具体的には、OmniVideoBenchは1000の高品質な質問応答(QA)ペアで構成されており、それぞれにステップバイステップの推論トレースがアノテートされている。さらに、OmniVideoBenchは、時間的推論、空間的局所化、カウント、因果推論、要約などを含む、慎重に設計された13の質問タイプを包含し、ビデオ理解の本質的な課題を捉えている。 OmniVideoBench上での複数のMLLMの評価では、モデル性能と人間の推論の間に明らかなギャップが見られ、オープンソースモデルはクローズドソースモデルよりもかなり遅れており、真のオーディオ視覚的推論の本質的な困難さが強調されている。我々はOmniVideoBenchをリリースし、より強力でより一般化可能な推論機能を持つMLLMの開発を促進する。

論文の概要: OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs

関連論文リスト