Fugu-MT 論文翻訳(概要): Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning

論文の概要: Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning

arxiv url: http://arxiv.org/abs/2509.23744v1
Date: Sun, 28 Sep 2025 08:46:11 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:19.416364
Title: Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning
Title（参考訳）: コンポジションとファウズ:マルチモーダル推論における基礎的ボツネックの再考
Authors: Yucheng Wang, Yifan Hou, Aydin Javadov, Mubashara Akhtar, Mrinmaya Sachan,
Abstract要約: MLLM(Multimodal large language model)は、テキスト、ビジョン、オーディオなどの多様な入力を統合することで推論を強化することを約束する。しかし、追加のモダリティがパフォーマンスを損なうかどうかについての報告は相反する。我々は、多モーダル推論を6つの相互作用パターンに分類し、事実がどのようにモダリティに分散され、論理的に組み合わせられるかを決定する。
参考スコア（独自算出の注目度）: 49.17801010041155
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal large language models (MLLMs) promise enhanced reasoning by integrating diverse inputs such as text, vision, and audio. Yet cross-modal reasoning remains underexplored, with conflicting reports on whether added modalities help or harm performance. These inconsistencies stem from a lack of controlled evaluation frameworks and analysis of models' internals to isolate when and why modality interactions support or undermine reasoning. We address this gap through a logic-grounded evaluation framework that categorizes multimodal reasoning into six interaction patterns, varying how facts are distributed across modalities and logically combined. Empirically, additional modalities enhance reasoning only when they provide independent and sufficient reasoning paths, while redundant or chained entailment support often hurts performance. Moreover, reasoning degrades in three systematic ways: weaker modalities drag down overall performance, conflicts bias preference toward certain modalities, and joint signals from different modalities fail to be integrated effectively. Therefore, we identify two core failures: task-composition bottleneck, where recognition and reasoning cannot be jointly executed in one pass, and fusion bottleneck, where early integration introduces bias. For further investigation, we find that attention patterns fail to encode fact usefulness, but a simple two-step prompting (recognize then reason) restores performance, confirming the task-composition bottleneck. Moreover, modality identity remains recoverable in early layers, and softening attention in early fusion improves reasoning, highlighting biased fusion as another failure mode. Overall, our findings show that integration, not perception, is the main barrier to multimodal reasoning, suggesting composition-aware training and early fusion control as promising directions.
Abstract（参考訳）: MLLM(Multimodal large language model)は、テキスト、ビジョン、オーディオなどの多様な入力を統合することで推論を強化することを約束する。しかし、追加のモダリティがパフォーマンスを損なうかどうかについての報告は相反する。これらの矛盾は、制御された評価フレームワークの欠如とモデルの内部の分析が、なぜモダリティの相互作用が推論をサポートするか、あるいは弱体化させるのかを、いつ、なぜ分離するかに起因している。このギャップは、多モーダル推論を6つの相互作用パターンに分類し、事実をモダリティに分散し、論理的に組み合わせた論理的評価フレームワークによって解決される。実証的には、追加のモダリティは、独立的で十分な推論パスを提供する場合にのみ推論を強化するが、冗長または連鎖的なエンターメントサポートは、しばしばパフォーマンスを損なう。さらに、より弱いモダリティは全体的な性能を低下させ、特定のモダリティに対する偏差優先の矛盾を生じさせ、異なるモダリティからの結合信号は効果的に統合されない。そこで,タスク・コンポジション・ボトルネック,認識と推論をひとつのパスで共同実行できないタスク・コンポジション・ボトルネック,早期統合でバイアスが発生するフュージョン・ボトルネックという2つのコア・障害を特定した。さらなる調査では、注意パターンは事実の有用性を符号化しないが、単純な2段階のプロンプト(認識と理由)によってパフォーマンスが回復し、タスク構成のボトルネックが確認される。さらに、初期の層ではモダリティの同一性は回復可能であり、初期の融合における注意の軟化は推論を改善し、別の障害モードとしてバイアスドフュージョンを強調させる。総じて,統合は多モーダル推論の主要な障壁であり,構成意識訓練と早期融合制御を有望な方向として示唆している。

論文の概要: Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning

関連論文リスト