Fugu-MT 論文翻訳(概要): Projection-Free Transformers via Gaussian Kernel Attention

論文の概要: Projection-Free Transformers via Gaussian Kernel Attention

arxiv url: http://arxiv.org/abs/2605.02144v1
Date: Mon, 04 May 2026 01:57:59 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:50.104115
Title: Projection-Free Transformers via Gaussian Kernel Attention
Title（参考訳）: ガウスカーネル注意による射影自由変換器
Authors: Debarshi Kundu, Archisman Ghosh, Swaroop Ghosh, Vasant Honavar,
Abstract要約: 変換器における自己注意は、通常$mathrmsoftmax(QKtop/sqrtd)V$として実装され、$Q=XW_Q$, $K=XW_K$, $V=XW_V$は入力された$X$の線形射影を学習する。ドット積注意のドロップイン置換である textbfGaussian Kernel Attention (GKA) を導入する。
参考スコア（独自算出の注目度）: 0.4899818550820574
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Self-attention in Transformers is typically implemented as $\mathrm{softmax}(QK^\top/\sqrt{d})V$, where $Q=XW_Q$, $K=XW_K$, and $V=XW_V$ are learned linear projections of the input $X$. We ask whether these learned projections are necessary, or whether they can be replaced by a simpler similarity-based diffusion operator. We introduce \textbf{Gaussian Kernel Attention} (GKA), a drop-in replacement for dot-product attention that computes token affinities directly using a Gaussian radial basis function (RBF) kernel applied to per-head token features. Each head learns only a bandwidth parameter $σ_h$, while a single output projection $W_O$ preserves compatibility with the standard Transformer interface. GKA can be interpreted as normalized kernel regression over tokens, linking modern Transformer architectures to classical non-local filtering and kernel smoothing methods. We evaluate GKA in both vision and language modeling settings. For autoregressive language modeling within the \texttt{nanochat} framework, we implement causal masking and sliding-window constraints by masking and renormalizing the Gaussian kernel. At depth 20, a GKA model with $0.42\times$ the parameters and $0.49\times$ the total training FLOPs of a standard attention baseline trains stably, exhibits a near-zero train-validation gap, and demonstrates competitive behavior on standard benchmarks, albeit with higher bits-per-byte (BPB) at this compute scale. Overall, GKA provides a minimal, interpretable attention mechanism with an explicit locality scale, offering a dimension in the accuracy-efficiency trade-off for Transformer design.
Abstract（参考訳）: 変換器の自己注意は一般に$\mathrm{softmax}(QK^\top/\sqrt{d})V$として実装され、$Q=XW_Q$, $K=XW_K$, $V=XW_V$は入力された$X$の線形射影を学習する。これらの学習された射影が必要なのか、あるいはより単純な類似性に基づく拡散作用素に置き換えられるのかを問う。本稿では,ガウス的ラジアル基底関数(RBF)カーネルを用いて,トークン親和性を直接計算するドット積アテンションのドロップイン置換である‘textbf{Gaussian Kernel Attention} (GKA) を紹介する。各ヘッドは帯域幅パラメータ$σ_h$のみを学習し、単一の出力プロジェクション$W_O$は標準のTransformerインターフェースとの互換性を保持する。 GKAはトークン上の正規化カーネルレグレッションとして解釈することができ、現代のトランスフォーマーアーキテクチャを古典的な非局所フィルタリングとカーネルスムーズな手法にリンクする。 GKAを視覚と言語モデリングの両方で評価する。そこで我々は,<texttt{nanochat} フレームワーク内での自動回帰言語モデリングを行うために,ガウスカーネルのマスキングと再正規化による因果マスクとスライディングウインドウ制約を実装した。深度20では、パラメータが0.42\times$と0.49\times$のGKAモデルが標準の注意ベースラインのFLOPを安定的にトレーニングし、ほぼゼロに近い平均値差を示し、この計算スケールでは高ビット/バイト(BPB)であるにもかかわらず、標準ベンチマーク上での競合挙動を示す。全体として、GKAは極小で解釈可能なアテンションメカニズムと明示的な局所性尺度を提供し、Transformer設計の精度と効率のトレードオフの次元を提供する。

論文の概要: Projection-Free Transformers via Gaussian Kernel Attention

関連論文リスト