Fugu-MT 論文翻訳(概要): LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs

論文の概要: LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs

arxiv url: http://arxiv.org/abs/2603.19217v1
Date: Thu, 19 Mar 2026 17:58:13 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-20 17:19:06.323672
Title: LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs
Title（参考訳）: LVOmniBench:Omnimodal LLMの長時間音声映像理解評価
Authors: Keda Tao, Yuhua Zheng, Jia Xu, Wenjie Du, Kele Shao, Hesong Wang, Xueyi Chen, Xin Jin, Junhan Zhu, Bohan Yu, Weiqiang Wang, Jian Liu, Can Qin, Yulun Zhang, Ming-Hsuan Yang, Huan Wang,
Abstract要約: このデータセットは、リッチなオーディオ視覚ダイナミクスを備えたオープンプラットフォームからソースされた高品質なビデオで構成されている。我々は,長期記憶,時間的局所化,きめ細かい理解,マルチモーダル知覚など,ドメイン間のOmniLLMの能力について精査した。オープンソースモデルは一般的に35%未満の精度を達成するが、Gemini 3 Proは65%のピーク精度に達する。
参考スコア（独自算出の注目度）: 68.35684758116453
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in omnimodal large language models (OmniLLMs) have significantly improved the comprehension of audio and video inputs. However, current evaluations primarily focus on short audio and video clips ranging from 10 seconds to 5 minutes, failing to reflect the demands of real-world applications, where videos typically run for tens of minutes. To address this critical gap, we introduce LVOmniBench, a new benchmark designed specifically for the cross-modal comprehension of long-form audio and video. This dataset comprises high-quality videos sourced from open platforms that feature rich audio-visual dynamics. Through rigorous manual selection and annotation, LVOmniBench comprises 275 videos, ranging in duration from 10 to 90 minutes, and 1,014 question-answer (QA) pairs. LVOmniBench aims to rigorously evaluate the capabilities of OmniLLMs across domains, including long-term memory, temporal localization, fine-grained understanding, and multimodal perception. Our extensive evaluation reveals that current OmniLLMs encounter significant challenges when processing extended audio-visual inputs. Open-source models generally achieve accuracies below 35%, whereas the Gemini 3 Pro reaches a peak accuracy of approximately 65%. We anticipate that this dataset, along with our empirical findings, will stimulate further research and the development of advanced models capable of resolving complex cross-modal understanding problems within long-form audio-visual contexts.
Abstract（参考訳）: オムニLLM(OmniLLMs)の最近の進歩は、音声およびビデオ入力の理解を著しく改善している。しかし、現在の評価は主に10秒から5分間の短いオーディオとビデオクリップに焦点を当てており、ビデオが通常数分間実行される現実世界のアプリケーションの要求を反映していない。この致命的なギャップに対処するため,LVOmniBenchという,長文音声とビデオの相互理解に特化した新しいベンチマークを導入する。このデータセットは、リッチなオーディオ視覚ダイナミクスを備えたオープンプラットフォームからソースされた高品質なビデオで構成されている。厳密なマニュアルの選択とアノテーションを通じて、LVOmniBenchは10分から90分に及ぶ275の動画と、1014の質問回答(QA)ペアで構成されている。 LVOmniBenchは、長期記憶、時間的ローカライゼーション、きめ細かい理解、マルチモーダル知覚を含む、ドメイン間でのOmniLLMの能力を厳格に評価することを目的としている。広範に評価した結果,OmniLLMは拡張音声視覚入力処理において大きな課題に直面することが明らかとなった。オープンソースモデルは一般的に35%未満の精度を達成するが、Gemini 3 Proは65%のピーク精度に達する。このデータセットは、我々の経験的知見とともに、長期の音声・視覚的文脈における複雑なモーダル理解問題を解決できる高度なモデルの開発とさらなる研究を促進することを期待する。

論文の概要: LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs

関連論文リスト