Fugu-MT 論文翻訳(概要): Gaussian Mixture Attention: Linear-Time Sequence Mixing via Probabilistic Latent Routing

論文の概要: Gaussian Mixture Attention: Linear-Time Sequence Mixing via Probabilistic Latent Routing

arxiv url: http://arxiv.org/abs/2606.18283v1
Date: Tue, 09 Jun 2026 23:35:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-18 17:16:50.787245
Title: Gaussian Mixture Attention: Linear-Time Sequence Mixing via Probabilistic Latent Routing
Title（参考訳）: ガウス混合注意:確率潜在経路による線形時間列混合
Authors: Yongchao Huang, Hassan Raza,
Abstract要約: GMAは、明示的なペアワイズクエリ-キー比較を$K$学習したガウス混合コンポーネントによるルーティングに置き換える。我々は、GMAの双方向および因果変異を定式化し、ガウス混合成分のエンドツーエンドの微分可能なパラメータ化を提供する。我々は、その責任変調構造、制約付き非負の低ランク親和性解釈、局所的なルーティング安定性を解析する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The dense token-to-token interaction pattern of standard dot-product attention remains a central bottleneck in scaling Transformer architectures to long contexts. We introduce \textbf{Gaussian Mixture Attention (GMA)}, a probabilistic attention-style sequence mixer that replaces explicit pairwise query--key comparison with routing through $K$ learned Gaussian mixture components. Queries and keys are mapped to posterior \textit{responsibility} vectors over a shared latent routing space; their overlap defines an implicit responsibility-space affinity, while values are written into and read from a $K$-slot latent memory. By exploiting the associativity of matrix multiplication, GMA avoids materializing the induced $N\times N$ affinity matrix and instead uses two responsibility matrices whose dominant activation storage scales as $\mathcal{O}(NK)$ rather than $\mathcal{O}(N^2)$ for fixed $K$. We formulate bidirectional and causal variants of GMA, provide an end-to-end differentiable parameterization of the Gaussian mixture components, and analyze its responsibility-modulated gradient structure, constrained non-negative low-rank affinity interpretation, and local routing stability. Empirically, GMA exhibits the intended fixed-$K$ linear memory scaling and is competitive with attention-style baselines on long-context classification, while causal GMA improves over tested linear/random-feature attention variants on WikiText-103 but remains behind optimized causal SDPA and Mamba in the current implementation. Analysis of learned responsibilities further shows broad component usage and moderate alignment with surface-form token categories, supporting GMA as a probabilistic, interpretable, fixed-$K$ linear-time attention-style alternative rather than a universal replacement for optimized softmax attention or state-space models.
Abstract（参考訳）: 標準的なドット・プロダクト・アテンションの密集したトークン・ツー・トークンの相互作用パターンは、トランスフォーマーアーキテクチャを長いコンテキストに拡張する上で、依然として中心的なボトルネックとなっている。本稿では,探索型ガウシアン混合成分のルーティングによるクエリキー比較を明示的に置き換えた確率的アテンションスタイルのシーケンスミキサである‘textbf{Gaussian Mixture Attention(GMA)’を紹介する。クエリとキーは共有潜在ルーティング空間上の後続の \textit{responsibility} ベクトルにマップされます。行列乗算の連想性を利用することにより、GMAは誘導された$N\times N$アフィニティ行列の実体化を避け、代わりに固定$K$に対して$\mathcal{O}(NK)$ではなく$\mathcal{O}(N^2)$として支配的なアクティベーションストレージスケールを持つ2つの責任行列を使用する。我々は、GMAの双方向および因果変異を定式化し、ガウス混合成分のエンドツーエンドの微分可能なパラメータ化を提供し、その責任変調勾配構造、制約付き非負の低ランク親和性解釈、局所ルーティング安定性を解析する。実証的には、GMAは意図された固定価格のリニアメモリスケーリングを示し、長文分類における注意スタイルのベースラインと競合する一方で、WikiText-103でテストされたリニア/ランダムなアテンションバリアントよりも改善されているが、現在の実装では最適化されたSDPAとMambaの背後に残っている。学習責任の分析はさらに、GMAを最適化されたソフトマックスアテンションや状態空間モデルに対する普遍的な置き換えではなく、確率的、解釈可能、固定的なK$線形時間アテンションスタイルの代替としてサポートする、表面形状のトークンカテゴリに対する幅広いコンポーネントの使用と適度なアライメントを示している。

論文の概要: Gaussian Mixture Attention: Linear-Time Sequence Mixing via Probabilistic Latent Routing

関連論文リスト