Fugu-MT 論文翻訳(概要): Multi-layer Cross-Attention is Provably Optimal for Multi-modal In-context Learning

論文の概要: Multi-layer Cross-Attention is Provably Optimal for Multi-modal In-context Learning

arxiv url: http://arxiv.org/abs/2602.04872v1
Date: Wed, 04 Feb 2026 18:57:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-05 19:45:11.698792
Title: Multi-layer Cross-Attention is Provably Optimal for Multi-modal In-context Learning
Title（参考訳）: マルチモーダル・インコンテクスト学習における多層クロスアテンションの最適性
Authors: Nicholas Barnfield, Subhabrata Sen, Pragya Sur,
Abstract要約: 本稿では,マルチモーダル学習を数学的に学習可能なフレームワークを導入し,変換器のようなアーキテクチャがベイズ最適性能をコンテキスト内で回復する方法について検討する。本研究は,マルチモーダル分布において,文脈内学習における奥行きの利点を強調し,クロスアテンションの有効性を確立することを目的とする。
参考スコア（独自算出の注目度）: 7.67220299822976
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent progress has rapidly advanced our understanding of the mechanisms underlying in-context learning in modern attention-based neural networks. However, existing results focus exclusively on unimodal data; in contrast, the theoretical underpinnings of in-context learning for multi-modal data remain poorly understood. We introduce a mathematically tractable framework for studying multi-modal learning and explore when transformer-like architectures can recover Bayes-optimal performance in-context. To model multi-modal problems, we assume the observed data arises from a latent factor model. Our first result comprises a negative take on expressibility: we prove that single-layer, linear self-attention fails to recover the Bayes-optimal predictor uniformly over the task distribution. To address this limitation, we introduce a novel, linearized cross-attention mechanism, which we study in the regime where both the number of cross-attention layers and the context length are large. We show that this cross-attention mechanism is provably Bayes optimal when optimized using gradient flow. Our results underscore the benefits of depth for in-context learning and establish the provable utility of cross-attention for multi-modal distributions.
Abstract（参考訳）: 最近の進歩は、現代の注目に基づくニューラルネットワークにおけるコンテキスト内学習の基礎となるメカニズムの理解を急速に進めている。対照的に、マルチモーダルデータに対する文脈内学習の理論的基盤はいまだに理解されていない。本稿では,マルチモーダル学習を数学的に学習可能なフレームワークを導入し,変換器のようなアーキテクチャがベイズ最適性能をコンテキスト内で回復する方法について検討する。マルチモーダルな問題をモデル化するために、観測されたデータは潜在因子モデルから生じると仮定する。最初の結果は,一層の線形自己注意が,タスク分布に対するベイズ最適予測器の一様回復に失敗することを証明した。この制限に対処するために、我々は、クロスアテンション層の数とコンテキスト長の両方が大きい状況下で研究する、新しい線形化されたクロスアテンション機構を導入する。勾配流を用いて最適化した場合,このクロスアテンション機構がベイズ最適であることを示す。本研究は,マルチモーダル分布において,文脈内学習における奥行きの利点を強調し,クロスアテンションの有効性を確立することを目的とする。

論文の概要: Multi-layer Cross-Attention is Provably Optimal for Multi-modal In-context Learning

関連論文リスト