Fugu-MT 論文翻訳(概要): MMOU: A Massive Multi-Task Omni Understanding and Reasoning Benchmark for Long and Complex Real-World Videos

論文の概要: MMOU: A Massive Multi-Task Omni Understanding and Reasoning Benchmark for Long and Complex Real-World Videos

arxiv url: http://arxiv.org/abs/2603.14145v1
Date: Sat, 14 Mar 2026 22:28:38 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 16:19:35.631358
Title: MMOU: A Massive Multi-Task Omni Understanding and Reasoning Benchmark for Long and Complex Real-World Videos
Title（参考訳）: MMOU: 長大で複雑な実世界のビデオのベンチマークを多人数で理解し、分析するMMOU
Authors: Arushi Goel, Sreyan Ghosh, Vatsal Agarwal, Nishit Anand, Kaousheik Jayakumar, Lasha Koroshinadze, Yao Xu, Katie Lyons, James Case, Karan Sapra, Kevin J. Shih, Siddharth Gururani, Abhinav Shrivastava, Ramani Duraiswami, Dinesh Manocha, Andrew Tao, Bryan Catanzaro, Mohammad Shoeybi, Wei Ping,
Abstract要約: MMOUは、15,000の慎重にキュレートされた質問と9038のウェブコレクトビデオからなる。ベンチマークには13の基本的なスキルカテゴリが含まれており、いずれもモダリティと時間にまたがるエビデンスを統合する必要がある。我々は、MMOU上で20以上の最先端のオープンソースおよびプロプライエタリなマルチモーダルモデルを評価する。
参考スコア（独自算出の注目度）: 118.61621763485465
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs) have shown strong performance in visual and audio understanding when evaluated in isolation. However, their ability to jointly reason over omni-modal (visual, audio, and textual) signals in long and complex videos remains largely unexplored. We introduce MMOU, a new benchmark designed to systematically evaluate multimodal understanding and reasoning under these challenging, real-world conditions. MMOU consists of 15,000 carefully curated questions paired with 9038 web-collected videos of varying length, spanning diverse domains and exhibiting rich, tightly coupled audio-visual content. The benchmark covers 13 fundamental skill categories, all of which require integrating evidence across modalities and time. All questions are manually annotated across multiple turns by professional annotators, ensuring high quality and reasoning fidelity. We evaluate 20+ state-of-the-art open-source and proprietary multimodal models on MMOU. The results expose substantial performance gaps: the best closed-source model achieves only 64.2% accuracy, while the strongest open-source model reaches just 46.8%. Our results highlight the challenges of long-form omni-modal understanding, revealing that current models frequently fail to apply even fundamental skills in long videos. Through detailed analysis, we further identify systematic failure modes and provide insights into where and why current models break.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は、独立して評価した場合、視覚的および音声的理解において高い性能を示す。しかし、長いビデオや複雑なビデオにおいて、オムニモーダル(視覚、音声、テキスト)の信号に対して共同で推論する能力は、ほとんど解明されていない。 MMOUは,これらの困難で現実的な条件下でのマルチモーダル理解と推論を体系的に評価するために設計された新しいベンチマークである。 MMOUは、15,000の精巧なキュレートされた質問と9038のウェブコレクトされたビデオの組み合わせで構成され、様々な領域にまたがり、リッチで密結合したオーディオ視覚コンテンツを表示する。ベンチマークには13の基本的なスキルカテゴリが含まれており、いずれもモダリティと時間にまたがるエビデンスを統合する必要がある。すべての質問は、プロのアノテータによって複数のターンに手動でアノテートされ、高品質と推論の忠実さが保証される。我々は、MMOU上で20以上の最先端のオープンソースおよびプロプライエタリなマルチモーダルモデルを評価する。最高のクローズドソースモデルは64.2%の精度しか達成せず、最強のオープンソースモデルは46.8%にしか達していない。以上の結果から,従来のモデルは長ビデオの基本的スキルにも適用できないことが判明した。詳細な分析を通じて、システマティックな障害モードをさらに特定し、現在のモデルが壊れた場所と理由に関する洞察を提供する。

論文の概要: MMOU: A Massive Multi-Task Omni Understanding and Reasoning Benchmark for Long and Complex Real-World Videos

関連論文リスト