Fugu-MT 論文翻訳(概要): Self-Attention as a Covariance Readout: A Unified View of In-Context Learning and Repetition

論文の概要: Self-Attention as a Covariance Readout: A Unified View of In-Context Learning and Repetition

arxiv url: http://arxiv.org/abs/2605.10466v1
Date: Mon, 11 May 2026 12:33:15 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 02:24:05.563376
Title: Self-Attention as a Covariance Readout: A Unified View of In-Context Learning and Repetition
Title（参考訳）: 共変読解としての自己認識--文脈内学習と反復の統一的視点
Authors: Haoren Xu, Guanhua Fang,
Abstract要約: 大規模言語モデル(LLM)は、インコンテキスト学習(ICL)と反復生成の2つの振る舞いを示す。どちらのモデルも、コンテキストを人口統計と捨てられたトークンレベルの詳細に要約したかのように振る舞う。この要約と「忘れる」は、注意機構自体から導き出すことができ、肯定的に答えられるかどうかを問う。
参考スコア（独自算出の注目度）: 8.250374560598495
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) exhibit two striking and ostensibly unrelated behaviours: in-context learning (ICL) and repetitive generation. In both, the model behaves as though it had summarised the context into a population-level statistic and discarded token-level detail. We ask whether this ``summarisation and forgetting'' can be derived from the attention mechanism itself, and answer in the affirmative. Under stationary, ergodic and elliptical inputs, the softmax attention output converges almost surely to $Θ_VΣΘ_K^{\top}Θ_Q x_t$, where $Σ$ is the input covariance; the long-context limit is therefore a linear readout of the input's second-order statistics. Two consequences follow. (i) For in-context linear regression, a single softmax head can implement one step of population gradient descent. Stacking such heads with residual connections iterates this update and implements multiple gradient descent steps. (ii) Propagated across an $L$-layer transformer, this readout drives the terminal hidden state at the parametric $1/t$ rate to a deterministic function of the current token alone, so that autoregressive generation collapses asymptotically to a first-order Markov chain whose attracting orbits furnish a structural account of repetition and mode collapse. The two phenomena thus emerge as facets of a single covariance-readout principle.
Abstract（参考訳）: 大規模言語モデル (LLM) は、インコンテキスト学習 (ICL) と反復生成 (repetitive generation) の2つの顕著かつ目に見える無関係な振る舞いを示す。どちらのモデルも、コンテキストを人口統計と捨てられたトークンレベルの詳細に要約したかのように振る舞う。この「要約と忘れ」は、注意機構自体から導き出すことができ、肯定的に答えられるかどうかを問う。定常的、エルゴディック的、楕円的入力の下では、ソフトマックスのアテンション出力は、ほぼ確実に$\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ \\\\\\\\\\\\\\\\\\\\\\\\\\\\ の入力の入力の2の2の2の 2つの結果が続く。 i) 文脈内線形回帰では, 1つのソフトマックスヘッドが集団勾配勾配の1ステップを実装できる。このようなヘッドを残りの接続で積み上げると、この更新が反復され、複数の勾配降下ステップが実装される。 (II)$L$層変圧器で表されるこの読み出しは、パラメトリックな1/t$レートで端末隠蔽状態を現在のトークンの確定関数に駆動するので、自己回帰生成が漸近的に1階マルコフ連鎖に崩壊し、軌道を引き付けることで繰り返しとモード崩壊という構造的説明を与える。この2つの現象は、単一の共分散-可算原理の面として現れる。

論文の概要: Self-Attention as a Covariance Readout: A Unified View of In-Context Learning and Repetition

関連論文リスト