Fugu-MT 論文翻訳(概要): Stochastic Sparse Attention for Memory-Bound Inference

論文の概要: Stochastic Sparse Attention for Memory-Bound Inference

arxiv url: http://arxiv.org/abs/2605.01910v1
Date: Sun, 03 May 2026 14:44:14 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:49.989964
Title: Stochastic Sparse Attention for Memory-Bound Inference
Title（参考訳）: メモリ境界推論のための確率的スパースアテンション
Authors: Kyle Lee, Corentin Delacour, Kevin Callahan-Coray, Kyle Jiang, Can Yaras, Samet Oymak, Tathagata Srimani, Kerem Y. Camsari,
Abstract要約: SANTA(Additive No-mult Attention)は,ソフトマックス後の分布から$S ll n_k$インデックスをサンプリングすることで,値キャッシュアクセスを分散する手法である。また、スコアステージをスパース化するための補完手法としてBernoulli $qKmathsfT$サンプリングを提案する。
参考スコア（独自算出の注目度）: 19.301894658575502
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autoregressive decoding becomes bandwidth-limited at long contexts, as generating each token requires reading all $n_k$ key and value vectors from KV cache. We present Stochastic Additive No-mulT Attention (SANTA), a method that sparsifies value-cache access by sampling $S \ll n_k$ indices from the post-softmax distribution and aggregates only those value rows. This yields an unbiased estimator of the post-softmax value aggregation while replacing value-stage multiply-accumulates with gather-and-add. We introduce stratified sampling to design variance-reduced, GPU-friendly variants, demonstrating $1.5\times$ decode-step attention kernel speedup over FlashInfer and FlashDecoding on an NVIDIA RTX 6000 Ada while matching baseline accuracy at 32k-token contexts. Finally, we propose Bernoulli $qK^\mathsf{T}$ sampling as a complementary technique to sparsify the score stage, reducing key-feature access through stochastic ternary queries. Both methods are orthogonal to upstream techniques such as ternary quantization, low-rank projections, and KV-cache compression. Together, they point toward sparse, multiplier-free, and energy-efficient inference. We open-source our kernels at: https://github.com/OPUSLab/SANTA.git
Abstract（参考訳）: 自動回帰復号は、KVキャッシュからすべての$n_k$キーと値ベクトルを読み取る必要があるため、長いコンテキストで帯域幅に制限される。 Indices by sample $S \ll n_k$ indices from the post-softmax distribution and aggregates that value rows。これにより、値ステージの乗算積を集合と加算に置き換えながら、ポストソフトマックス値のアグリゲーションの偏りのない推定値が得られる。我々は,分散リデュースされたGPUフレンドリなモデルの設計に階層化サンプリングを導入し,ベースライン精度を32k-tokenコンテキストで一致させながら,FlashInfer と FlashDecoding 上でのdecode-step attention kernel の高速化を実現した。最後に,Bernoulli $qK^\mathsf{T}$ サンプリングを相補的手法として提案する。どちらの手法も直交する3次量子化、低ランク射影、KV-cache圧縮などの上流技術である。共に、スパース、乗算子なし、エネルギー効率のよい推論を指している。私たちはカーネルをhttps://github.com/OPUSLab/SANTA.gitでオープンソース化しました。

論文の概要: Stochastic Sparse Attention for Memory-Bound Inference

関連論文リスト