Fugu-MT 論文翻訳(概要): Why Multimodal In-Context Learning Lags Behind? Unveiling the Inner Mechanisms and Bottlenecks

論文の概要: Why Multimodal In-Context Learning Lags Behind? Unveiling the Inner Mechanisms and Bottlenecks

arxiv url: http://arxiv.org/abs/2604.13403v1
Date: Wed, 15 Apr 2026 02:12:51 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-16 20:38:32.351594
Title: Why Multimodal In-Context Learning Lags Behind? Unveiling the Inner Mechanisms and Bottlenecks
Title（参考訳）: なぜマルチモーダルなインテクスト学習ラグが背後にあるのか?
Authors: Yu Wang, Sharon Li,
Abstract要約: In-context Learning (ICL)は、モデルが推論時デモを通じて新しいタスクに適応できるようにする。大規模な言語モデルでの成功にもかかわらず、ICLのマルチモーダル設定への拡張はいまだに理解されていない。マルチモーダル ICL はゼロショット設定ではテキストのみの ICL と相容れない性能を示すが,数発のデモでは著しく劣化する。
参考スコア（独自算出の注目度）: 7.62772056485722
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In-context learning (ICL) enables models to adapt to new tasks via inference-time demonstrations. Despite its success in large language models, the extension of ICL to multimodal settings remains poorly understood in terms of its internal mechanisms and how it differs from text-only ICL. In this work, we conduct a systematic analysis of ICL in multimodal large language models. Using identical task formulations across modalities, we show that multimodal ICL performs comparably to text-only ICL in zero-shot settings but degrades significantly under few-shot demonstrations. To understand this gap, we decompose multimodal ICL into task mapping construction and task mapping transfer, and analyze how models establish cross-modal task mappings, and transfer them to query samples across layers. Our analysis reveals that current models lack reasoning-level alignment between visual and textual representations, and fail to reliably transfer learned task mappings to queries. Guided by these findings, we further propose a simple inference-stage enhancement method that reinforces task mapping transfer. Our results provide new insights into the mechanisms and limitations of multimodal ICL and suggest directions for more effective multimodal adaptation. Our code is available \href{https://github.com/deeplearning-wisc/Multimocal-ICL-Analysis-Framework-MGI}{here}.
Abstract（参考訳）: In-context Learning (ICL)は、モデルが推論時デモを通じて新しいタスクに適応できるようにする。大規模な言語モデルでの成功にもかかわらず、ICLのマルチモーダル設定への拡張は、内部メカニズムやテキストのみのICLとどのように異なるかという点で、いまだに理解されていない。本研究では,多モーダル大規模言語モデルにおけるICLの体系的解析を行う。マルチモーダル ICL はゼロショット設定ではテキストのみの ICL と同等に機能するが,数発のデモでは著しく低下する。このギャップを理解するために、マルチモーダル ICL をタスクマッピング構築とタスクマッピング転送に分解し、モデルがどのようにクロスモーダルなタスクマッピングを確立するかを分析し、それらを層間のクエリサンプルに転送する。分析の結果,現在のモデルでは視覚的およびテキスト的表現間の推論レベルのアライメントが欠如しており,学習したタスクマッピングをクエリに確実に転送できないことがわかった。これらの知見に導かれて,タスクマッピング転送を補強するシンプルな推論段階拡張手法を提案する。本研究は,マルチモーダルICLのメカニズムと限界に関する新たな知見を提供し,より効果的なマルチモーダル適応の方向性を提案する。私たちのコードは href{https://github.com/deeplearning-wisc/Multimocal-ICL-Analysis-Framework-MGI}{here} で利用可能です。

論文の概要: Why Multimodal In-Context Learning Lags Behind? Unveiling the Inner Mechanisms and Bottlenecks

関連論文リスト