Fugu-MT 論文翻訳(概要): Routing Absorption in Sparse Attention: Why Random Gates Are Hard to Beat

論文の概要: Routing Absorption in Sparse Attention: Why Random Gates Are Hard to Beat

arxiv url: http://arxiv.org/abs/2603.02227v1
Date: Wed, 11 Feb 2026 15:06:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 01:20:08.091856
Title: Routing Absorption in Sparse Attention: Why Random Gates Are Hard to Beat
Title（参考訳）: ゆるやかな吸収:なぜランダムゲートが耐え難いのか
Authors: Keston Aquino-Michaels,
Abstract要約: 疎い注意がエンドツーエンドにトレーニングされると、モデルのQ/K/V投影は、どんなマスクにも適応する。微分可能なソフトゲーティングは、ゲートが学習されているかランダムであるかに関わらず、ほぼ同じ難易度に収束する。専門家はどのルーターにも適応するが、注意は構造的により厳しい形を示すことを示している。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Can a transformer learn which attention entries matter during training? In principle, yes: attention distributions are highly concentrated, and a small gate network can identify the important entries post-hoc with near-perfect accuracy. In practice, barely. When sparse attention is trained end-to-end, the model's Q/K/V projections co-adapt to whatever mask is imposed, absorbing the routing signal until learned gates perform little better than frozen random gates. We call this routing absorption and present four independent lines of evidence for it in a controlled 31M-parameter transformer: (1) differentiable soft gating converges to nearly the same perplexity whether the gate is learned or random (48.73 +/- 0.60 vs. 49.83 +/- 0.04 over 3 seeds); (2) hard top-k gating receives exactly zero gradient through the mask; (3) a gate distilled onto co-adapted Q/K/V achieves high F1 against oracle masks but catastrophic perplexity when deployed (601.6 vs. 48.6 on mask-agnostic Q/K/V); and (4) stochastic mask randomization during training fails to prevent co-adaptation (78.2 ppl deployed dense vs. 37.3 baseline). We connect routing absorption to the same phenomenon in Mixture-of-Experts, where random routing matches learned routing because experts co-adapt to any router, but show that attention exhibits a structurally more severe form: shared Q/K/V parameters enable cross-layer compensation pathways absent in MoE, where experts are self-contained modules. The implication is that end-to-end sparse attention methods employing per-query token-level gating face absorption pressure proportional to the parameter asymmetry between the gate and the model, and that post-hoc approaches, which decouple representation learning from sparsification, sidestep this entirely.
Abstract（参考訳）: トランスフォーマーは、トレーニング中にどの注意項目が重要かを学ぶことができるか? 原則として、注意分布は高度に集中しており、小さなゲートネットワークはポストホックの重要項目をほぼ完全精度で識別することができる。実際には、ほとんど。粗い注意をエンドツーエンドにトレーニングする場合、モデルのQ/K/Vプロジェクションは任意のマスクに適応し、学習ゲートが凍結したランダムゲートよりもほとんどパフォーマンスしないまでルーティング信号を吸収する。制御された31Mパラメータ変換器において、この経路の吸収と4つの独立した証拠を提示する: (1) 可変なソフトゲーティングは、ゲートが学習されるかランダムであるか(48.73 +/- 0.60 vs. 49.83 +/- 0.04 over 3 seed)、(2) ハードトップキーゲーティングは、マスクを通して正確にゼロ勾配を受信し、(3) 共適応Q/K/Vに蒸留されたゲートは、オラクルマスクに対して高いF1を達成するが、デプロイ時に破滅的なパープレキシティ(601.6 vs. 48.6 on mask-agnostic Q/K/V)、(4) トレーニング中の確率的マスクランダム化は、コダプテーションの防止に失敗する(78.2 ppl vs 373)。我々は,Mixture-of-Expertsにおいて,ランダムなルーティングがどのルータにも適応するため,ランダムなルーティングがルーティングを学習するのと同じ現象にルーティングの吸収を接続する。その意味は、ゲートとモデルの間のパラメータ非対称性に比例した、キー単位のガッティング面吸収圧力を用いたエンドツーエンドのスパースアテンション法と、疎化から表現学習を分離するポストホックアプローチが、完全にその逆であるということである。

論文の概要: Routing Absorption in Sparse Attention: Why Random Gates Are Hard to Beat

関連論文リスト