Fugu-MT 論文翻訳(概要): AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs

論文の概要: AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs

arxiv url: http://arxiv.org/abs/2606.07643v1
Date: Mon, 01 Jun 2026 19:12:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:05.179298
Title: AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs
Title（参考訳）: AVI-Bench:Omni-MLLMのヒューマンライクなオーディオビジュアルインテリジェンスを目指して
Authors: Yaoting Wang, Ziyi Zhang, Wenming Tu, Shaoxuan Xu, Wenjie Du, Cheng Liang, Weijun Wang, Yuanchao Li, Guangyao Li, Hao Fei, Yuanchun Li, Henghui Ding, Yunxin Liu,
Abstract要約: 我々は,Omni-MLLMを3段階,認識,理解,推論の3段階にわたって評価するベンチマークを導入する。 AVI-Benchは、モデル機能と障害モードのきめ細かい診断を可能にする。 PriSeは、未知の低セマンティック刺激を用いて、モデルの原始的な視覚感覚を探索する。
参考スコア（独自算出の注目度）: 64.22272455664884
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in Omni-Multimodal Large Language Models (Omni-MLLMs) have enabled strong integration of vision, audio, and language. However, their audio-visual intelligence (AVI) remains insufficiently evaluated due to the lack of systematic and comprehensive benchmarks. We introduce AVI-Bench, a cognitively inspired benchmark that evaluates Omni-MLLMs across three stages, perception, understanding, and reasoning, through cross-modal tasks requiring joint audio-visual interpretation. This design enables fine-grained diagnosis of model capabilities and failure modes. To further assess robustness beyond familiar domains, we propose AVI-Bench-PriSe, an extension that probes models' primitive audio-visual sensation using unfamiliar, low-semantic stimuli, testing generalization beyond common training distributions. Extensive experiments on both open-source and closed-source models reveal substantial limitations in current Omni-MLLMs. Based on these findings, we present a four-level AVI taxonomy. Overall, AVI-Bench provides a principled evaluation framework to guide the development of more robust and generalizable AVI. Project website: https://fudancvl.github.io/AVI-Bench/
Abstract（参考訳）: Omni-Multimodal Large Language Models (Omni-MLLMs) の最近の進歩は、視覚、音声、言語を強力な統合を可能にしている。しかし、そのオーディオ視覚インテリジェンス(AVI)は、体系的で包括的なベンチマークが欠如しているため、十分に評価されていない。我々は,Omni-MLLMを3段階,知覚,理解,推論の3段階にわたって評価する,認知にインスパイアされたベンチマークであるAVI-Benchを紹介した。この設計により、モデル機能と障害モードのきめ細かい診断が可能になる。 AVI-Bench-PriSeは、慣れ親しんだ領域を超えたロバスト性を評価するために、慣れ親しんだ低セマンティックな刺激を用いて、モデルの原始的な視覚感覚を探索し、一般的なトレーニング分布を超えた一般化をテストする拡張である。オープンソースモデルとクローズドソースモデルの両方に対する大規模な実験は、現在のOmni-MLLMにかなりの制限を課している。以上より,4段階のAVI分類を施行した。全体として、AVI-Benchはより堅牢で一般化可能なAVIの開発を導くための、原則化された評価フレームワークを提供する。プロジェクトウェブサイト:https://fudancvl.github.io/AVI-Bench/

論文の概要: AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs

関連論文リスト