Fugu-MT 論文翻訳(概要): MIBench: Evaluating LMMs on Multimodal Interaction

論文の概要: MIBench: Evaluating LMMs on Multimodal Interaction

arxiv url: http://arxiv.org/abs/2603.13427v1
Date: Fri, 13 Mar 2026 03:02:24 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 16:19:35.193863
Title: MIBench: Evaluating LMMs on Multimodal Interaction
Title（参考訳）: MIBench: マルチモーダルインタラクションにおけるLMMの評価
Authors: Yu Miao, Zequn Yang, Yake Wei, Ziheng Chen, Haotian Ni, Haodong Duan, Kai Chen, Di Hu,
Abstract要約: MIBenchは、LMM(Large Multimodal Models)のマルチモーダル相互作用能力を評価するために設計されたベンチマークである。 MIBenchは、32の異なるタスクにまたがる1万以上の視覚コンテキストコンテキストからなる。
参考スコア（独自算出の注目度）: 44.761361565906924
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In different multimodal scenarios, it needs to integrate and utilize information across modalities in a specific way based on the demands of the task. Different integration ways between modalities are referred to as "multimodal interaction". How well a model handles various multimodal interactions largely characterizes its multimodal ability. In this paper, we introduce MIBench, a comprehensive benchmark designed to evaluate the multimodal interaction capabilities of Large Multimodal Models (LMMs), which formulates each instance as a (con_v , con_t, task) triplet with contexts from vision and text, necessitating that LMMs employ correct forms of multimodal interaction to effectively complete the task. MIBench assesses models from three key aspects: the ability to source information from vision-centric or text-centric cues, and the ability to generate new information from their joint synergy. Each interaction capability is evaluated hierarchically across three cognitive levels: Recognition, Understanding, and Reasoning. MIBench comprises over 10,000 vision-text context pairs spanning 32 distinct tasks. Evaluation of state-of-the-art LMMs show that: (1) LMMs' ability on multimodal interaction remains constrained, despite the scaling of model parameters and training data; (2) they are easily distracted by textual modalities when processing vision information; (3) they mostly possess a basic capacity for multimodal synergy; and (4) natively trained multimodal models show noticeable deficits in fundamental interaction ability. We expect that these observations can serve as a reference for developing LMMs with more enhanced multimodal ability in the future.
Abstract（参考訳）: 異なるマルチモーダルシナリオでは、タスクの要求に基づいて特定の方法で、モーダルをまたいだ情報を統合し、活用する必要があります。モジュラリティ間の異なる統合方法は「マルチモーダル相互作用」と呼ばれる。モデルがどのように様々なマルチモーダル相互作用を処理するかは、そのマルチモーダル能力の特徴である。本稿では,大規模マルチモーダルモデル(LMM)のマルチモーダルインタラクション機能を評価するために設計された総合ベンチマークMIBenchを紹介する。 MIBenchは、3つの重要な側面からモデルを評価する。ビジョン中心またはテキスト中心のキューから情報をソースする機能と、ジョイントシナジーから新しい情報を生成する機能だ。各相互作用能力は3つの認知レベル(認識、理解、推論)で階層的に評価される。 MIBenchは、32の異なるタスクにまたがる1万以上の視覚コンテキストコンテキストからなる。現状のLMMの評価では,(1)モデルパラメータとトレーニングデータのスケーリングにもかかわらず,LMMのマルチモーダル相互作用能力は制約され,(2)視覚情報処理時のテキストモダリティによって容易に阻害され,(3)主にマルチモーダル・シナジーの基本的な能力を有し,(4)ネイティブに訓練されたマルチモーダルモデルでは,基本的な相互作用能力に顕著な欠陥が示される。今後, マルチモーダル能力の向上が期待できる LMM の開発基準として, これらの観測が期待できる。

論文の概要: MIBench: Evaluating LMMs on Multimodal Interaction

関連論文リスト