Fugu-MT 論文翻訳(概要): Multi-Head Attention as Ensemble Nadaraya-Watson Estimation: Variance Reduction, Decorrelation, and Optimal Head Diversity

論文の概要: Multi-Head Attention as Ensemble Nadaraya-Watson Estimation: Variance Reduction, Decorrelation, and Optimal Head Diversity

arxiv url: http://arxiv.org/abs/2605.20271v1
Date: Mon, 18 May 2026 23:43:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-21 19:19:56.255941
Title: Multi-Head Attention as Ensemble Nadaraya-Watson Estimation: Variance Reduction, Decorrelation, and Optimal Head Diversity
Title（参考訳）: ナダラヤ・ワトソン推定を組み込んだ多面的注意:変動低減、デコレーション、最適頭部の多様性
Authors: Ernest Fokoué,
Abstract要約: 我々は,ナダラヤ・ワトソン(NW)カーネル回帰推定器のアンサンブルとして,マルチヘッドアテンション(MHA)の厳密な理論を開発した。 MHA は H NW 推定器の構造的アンサンブルであり、それぞれがキー空間の異なる学習された射影部分空間で作用することを示す。頭内デコリレーションの計算可能なスペクトル尺度であるHDI(Head Diversity Index)を導入し,MHA平均二乗誤差がHDIにおいて単調に減少していることを証明する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We develop a rigorous statistical theory of multi-head attention (MHA) as an ensemble of Nadaraya-Watson (NW) kernel regression estimators. Building on the algebraic identity between single-head softmax attention and the NW estimator, we prove that MHA is a structured ensemble of H NW estimators, each operating in a distinct learned projection subspace of the key space. We derive an explicit Bias-Variance-Covariance decomposition of the MHA mean squared error, showing that variance reduction depends not merely on the number of heads H but fundamentally on the decorrelation of head outputs. Decorrelation is governed by the principal angles between learned projection subspaces: orthogonal projections yield maximum variance reduction; aligned projections yield none. We introduce the Head Diversity Index (HDI), a computable spectral measure of inter-head decorrelation, and prove that MHA mean squared error is monotonically decreasing in HDI. This provides the first rigorous theoretical explanation for the empirically observed specialization of attention heads. Under a fixed total-dimension budget D = H * d_k, we solve the optimal head-dimension allocation problem, deriving the MSE-minimizing pair (H*, d_k*) from data distribution and regression smoothness. The solution yields a new architectural scaling law: the optimal per-head dimension grows logarithmically with training set size, while the optimal number of heads grows nearly linearly with the total budget D. Our framework unifies three strands of prior work: the NW theory of single-head attention, the general weighting theory for ensemble learning, and the decorrelation-variance-reduction isomorphism between biological and computational ensembles. Multi-head attention is the Transformer's instantiation of a universal principle: identical agents plus diversity-enforcing mechanisms yields emergent optimality.
Abstract（参考訳）: 我々は,ナダラヤ・ワトソン(NW)カーネル回帰推定器のアンサンブルとして,マルチヘッドアテンション(MHA)の厳密な統計理論を開発した。シングルヘッドソフトマックスアテンションとNW推定器の代数的同一性に基づいて、MHAはHNW推定器の構造的アンサンブルであり、それぞれがキー空間の異なる学習された射影部分空間で動作することを証明する。 MHAの平均二乗誤差を明示的にバイアス-分散-共分散分解することにより、分散の低減はヘッド数Hだけでなく、ヘッド出力のデコリレーションにも依存することを示す。退化は、学習された射影部分空間の間の主角によって支配される:直交射影は最大分散還元をもたらす; 整列射影は、何も生じない。頭内デコレーションの計算可能なスペクトル尺度である頭部多様性指数(HDI)を導入し,MHA平均二乗誤差がHDIにおいて単調に減少していることを証明する。これは、経験的に観察されたアテンションヘッドの特殊化に関する、最初の厳密な理論的説明を提供する。固定された全次元予算 D = H * d_k の下で、データ分布と回帰滑らか性から MSE最小化ペア (H*, d_k*) を導出し、最適な頭部次元割り当て問題を解く。本研究の枠組みは, 単頭注目のNW理論, アンサンブル学習の一般重み付け理論, 生物学的および計算的アンサンブル間のデコリレーション-分散-還元同型という, 先行研究の3つのストランドを統一するものである。マルチヘッドの注意はトランスフォーマーの普遍原理のインスタンス化であり、同一のエージェントと多様性を付与するメカニズムは創発的最適性をもたらす。

論文の概要: Multi-Head Attention as Ensemble Nadaraya-Watson Estimation: Variance Reduction, Decorrelation, and Optimal Head Diversity

関連論文リスト