Fugu-MT 論文翻訳(概要): What do vision-language models see in the context? Investigating multimodal in-context learning

論文の概要: What do vision-language models see in the context? Investigating multimodal in-context learning

arxiv url: http://arxiv.org/abs/2510.24331v1
Date: Tue, 28 Oct 2025 11:55:24 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-29 15:35:37.101366
Title: What do vision-language models see in the context? Investigating multimodal in-context learning
Title（参考訳）: 視覚言語モデルと文脈 : マルチモーダル・イン・コンテクスト学習の考察
Authors: Gabriel O. dos Santos, Esther Colombini, Sandra Avila,
Abstract要約: インコンテキスト学習(ICL)により、大規模言語モデルでは、パラメータ更新なしで実演例からタスクを学習することができる。視覚言語モデル(VLM)におけるICLの体系的研究について述べる。我々は、設計、アーキテクチャの選択、トレーニング戦略がマルチモーダルICLにどのように影響するかを分析する。
参考スコア（独自算出の注目度）: 2.1119217917006234
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In-context learning (ICL) enables Large Language Models (LLMs) to learn tasks from demonstration examples without parameter updates. Although it has been extensively studied in LLMs, its effectiveness in Vision-Language Models (VLMs) remains underexplored. In this work, we present a systematic study of ICL in VLMs, evaluating seven models spanning four architectures on three image captioning benchmarks. We analyze how prompt design, architectural choices, and training strategies influence multimodal ICL. To our knowledge, we are the first to analyze how attention patterns in VLMs vary with an increasing number of in-context demonstrations. Our results reveal that training on imag-text interleaved data enhances ICL performance but does not imply effective integration of visual and textual information from demonstration examples. In contrast, instruction tuning improves instruction-following but can reduce reliance on in-context demonstrations, suggesting a trade-off between instruction alignment and in-context adaptation. Attention analyses further show that current VLMs primarily focus on textual cues and fail to leverage visual information, suggesting a limited capacity for multimodal integration. These findings highlight key limitations in the ICL abilities of current VLMs and provide insights for enhancing their ability to learn from multimodal in-context examples.
Abstract（参考訳）: In-context Learning (ICL)は、大規模言語モデル(LLM)がパラメータ更新なしで実演例からタスクを学習できるようにする。 LLMで広く研究されているが、VLM(Vision-Language Models)におけるその効果はいまだに未調査である。本稿では,3つの画像キャプションベンチマークを用いて,4つのアーキテクチャにまたがる7つのモデルを評価した。我々は、設計、アーキテクチャの選択、トレーニング戦略がマルチモーダルICLにどのように影響するかを分析する。我々の知る限り、VLMにおける注意パターンが、コンテキスト内デモの増加とともにどのように変化するかを分析するのは、私たちは初めてである。この結果から,画像テキストインターリーブデータのトレーニングはICL性能を向上させるが,実演例からの視覚情報とテキスト情報の統合を効果的に行なわないことが明らかとなった。対照的に、インストラクションチューニングは命令追従を改善するが、インコンストラクタのデモへの依存を減らすことができ、インコンストラクタアライメントとインコンストラクタ適応のトレードオフを示唆している。注意分析により、現在のVLMはテキストの手がかりに重点を置いており、視覚情報の活用に失敗したことが示され、マルチモーダル統合の限界が示唆された。これらの知見は、現在のVLMのICL能力における重要な限界を浮き彫りにし、マルチモーダルなインコンテキストの例から学ぶ能力を高めるための洞察を提供する。

論文の概要: What do vision-language models see in the context? Investigating multimodal in-context learning

関連論文リスト