Fugu-MT 論文翻訳(概要): Transformers Can Implement Preconditioned Richardson Iteration for In-Context Gaussian Kernel Regression

論文の概要: Transformers Can Implement Preconditioned Richardson Iteration for In-Context Gaussian Kernel Regression

arxiv url: http://arxiv.org/abs/2605.08475v1
Date: Fri, 08 May 2026 20:50:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:49.67373
Title: Transformers Can Implement Preconditioned Richardson Iteration for In-Context Gaussian Kernel Regression
Title（参考訳）: In-Context Gaussian Kernel Regression のための事前条件付き Richardson Iteration の実装
Authors: Mingsong Yan, Dongyang Li, Charles Kulick, Sui Tang,
Abstract要約: ガウスカーネルを用いたインコンテキストカーネルリッジ回帰(KRR)について検討する。我々は,標準ソフトマックスアテンション変換器が前方通過時のKRR予測器を近似可能であることを示す。その誤差プロファイルは、事前条件付きリチャードソン反復と最も一貫して一致していることが分かる。
参考スコア（独自算出の注目度）: 13.818160005611752
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Mechanistic accounts of in-context learning (ICL) have identified iterative algorithms for linear regression and related linear prediction tasks, often using linear or ReLU attention variants. For nonlinear ICL, prior work has related softmax and kernelized attention to functional-gradient-type dynamics, but it remains unclear whether a standard transformer with softmax attention can implement a convergent solver with an end-to-end prediction-error guarantee. In this paper, we study in-context kernel ridge regression (KRR) with Gaussian kernels and show that a standard softmax-attention transformer can approximate the KRR predictor during its forward pass by implementing preconditioned Richardson iteration on the associated kernel linear system. Under bounded-data assumptions, we construct a single-head transformer with $O(\log(1/ε))$ blocks and MLP width $O(\sqrt{N/ε})$ that achieves $ε$-accurate prediction for prompts of length $N$. Our construction reveals a functional decomposition within the transformer architecture: softmax attention produces a row-normalized Gaussian-kernel operator needed for cross-token interactions, while ReLU MLP layers act locally to approximate the intra-token scalar arithmetic required by the update. Empirically, we train GPT-2-style transformers on Gaussian-process regression tasks to further test the preconditioned Richardson interpretation. Through linear probing, we compare the transformer's layer-wise predictions with the step-wise outputs of classical KRR solvers and find that its error profiles align most consistently with preconditioned Richardson iteration. Ablation studies further support this interpretation. Together, our theory and experiments identify preconditioned Richardson iteration as a concrete mechanism that softmax-attention transformers can realize for nonlinear in-context Gaussian-kernel regression.
Abstract（参考訳）: 文脈内学習(ICL)の力学的な説明は、線形回帰および関連する線形予測タスクの反復アルゴリズムを、しばしば線形またはReLU注意変種を用いて同定している。非線形ICLの場合、先行研究はソフトマックスと関数次数型力学に関連があるが、ソフトマックスに注意を向けた標準変圧器がエンドツーエンドの予測エラーを保証した収束解器を実装できるかどうかは不明である。本稿では、ガウスカーネルを用いたインコンテキストカーネルリッジ回帰(KRR)について検討し、標準ソフトマックスアテンション変換器が、関連するカーネル線形系に事前条件付きリチャードソン繰り返しを実装することにより、前方通過中にKRR予測器を近似可能であることを示す。有界データ仮定の下では、$O(\log(1/ε))$ブロックとMLP幅$O(\sqrt{N/ε})$を持つ単一ヘッド変換器を構築し、長さ$N$のプロンプトに対して$ε$精度の予測を行う。ソフトマックスアテンションは、クロストケン相互作用に必要な行正規化ガウスカーネル演算子を生成し、ReLU MLP層は、更新に必要なトーケン内スカラー演算を近似するために局所的に作用する。経験的に、我々はガウス過程の回帰タスクでGPT-2スタイルの変換器を訓練し、事前条件付きリチャードソン解釈をさらに検証する。線形探索により、変換器の層次予測と古典的KRRソルバのステップ次出力を比較し、その誤差プロファイルがプリコンディショニングされたリチャードソンの繰り返しとほぼ一致していることを見出した。アブレーション研究はこの解釈をさらに支持している。この理論と実験により、プレコンディショニングされたリチャードソン反復は、ソフトマックスアテンション変換器が非線形な文脈内ガウス-カーネル回帰を実現するための具体的なメカニズムであると同定された。

論文の概要: Transformers Can Implement Preconditioned Richardson Iteration for In-Context Gaussian Kernel Regression

関連論文リスト