Fugu-MT 論文翻訳(概要): A Mean-Field Analysis of Multi-Head Self-Attention under Cross-Entropy Training

論文の概要: A Mean-Field Analysis of Multi-Head Self-Attention under Cross-Entropy Training

arxiv url: http://arxiv.org/abs/2606.10469v1
Date: Tue, 09 Jun 2026 06:38:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-10 15:40:58.352853
Title: A Mean-Field Analysis of Multi-Head Self-Attention under Cross-Entropy Training
Title（参考訳）: クロスエントロピートレーニングにおける多面的自己意識の平均場解析
Authors: Cheng Huan, Hongfwei Yuan,
Abstract要約: 本稿では,クロスエントロピー最小化により学習した単一層因果多頭部自己注意モデルの平均場理論について述べる。無限の上限において、平均的な注意ログは確率測度上のリスク関数を定義し、その最初の変動は非線形ワッサーシュタイン勾配流方程式を生成する。我々は,PDEの長期的挙動について検討する:エネルギー散逸,コンパクト性の下での定常集合への収束,トポロジカルあるいはクルディカ-オジャシエヴィチ仮定の下での1つの定常測度への収束,勾配支配条件下での明示的な収束率。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper develops a mean-field theory for a simplified single-layer causal multi-head self-attention model trained by cross-entropy minimization. Each attention head is treated as a particle in parameter space, and the empirical law of the heads is used as the large-head state variable. In the infinite-head limit, the averaged attention logits define a risk functional on probability measures, whose first variation generates a nonlinear Wasserstein gradient-flow equation. Unlike classical mean-field analyses of shallow networks that often focus on square-loss regression, the present model contains the softmax residual from the cross-entropy objective and the query-key-value structure of masked self-attention. We prove a static finite-head approximation bound for the optimal risk, characterize global minimizers through a variational support condition, and establish a quantitative finite-time propagation-of-chaos estimate comparing finite-head stochastic gradient descent with the limiting PDE. We then study the long-time behavior of the PDE: energy dissipation, convergence to the stationary set under compactness, convergence to a single stationary measure under topological or Kurdyka--Łojasiewicz assumptions, and explicit convergence rates under gradient-domination conditions. Finally, we prove local exponential stability under a Wasserstein strong-monotonicity condition and give verifiable stability and instability criteria for Dirac stationary measures. The results provide a rigorous baseline mean-field framework for attention-head training and clarify the additional compactness, landscape, and curvature assumptions needed to pass from stationarity to convergence and stability.
Abstract（参考訳）: 本稿では,クロスエントロピー最小化により学習した簡易な単一層因果多頭部自己注意モデルの平均場理論を開発する。各アテンションヘッドはパラメータ空間の粒子として扱われ、ヘッドの経験則が大きなヘッド状態変数として使用される。無限ヘッド極限において、平均的な注意ログは確率測度上のリスク関数を定義し、その最初の変動は非線形ワッサーシュタイン勾配流方程式を生成する。直交エントロピー対象のソフトマックス残差とマスキング自己アテンションのクエリキー値構造を含む。我々は、最適リスクに縛られた静的な有限頭部近似を証明し、変分支援条件により大域最小化を特徴付けるとともに、有限頭部確率勾配勾配と有限頭部確率勾配を制限PDEと比較した定量的な有限時間確率分布推定を確立する。次に、PDEの長時間の挙動について研究する:エネルギー散逸、コンパクト性の下での定常集合への収束、トポロジカルあるいはクルディカ-ジョジャシエヴィチの仮定の下での単一の定常測度への収束、勾配支配条件下での明示的な収束率。最後に、ワッサーシュタインの強い単調性条件下での局所指数安定性を証明し、ディラック定常測度に対する検証可能な安定性と不安定性基準を与える。その結果, 集中訓練のための厳密なベースライン平均場フレームワークが提供され, 定常性から収束性, 安定性への移行に必要な, コンパクト性, ランドスケープ, および曲率の仮定が明確になった。

論文の概要: A Mean-Field Analysis of Multi-Head Self-Attention under Cross-Entropy Training

関連論文リスト