Fugu-MT 論文翻訳(概要): EntmaxKV: Support-Aware Decoding for Entmax Attention

論文の概要: EntmaxKV: Support-Aware Decoding for Entmax Attention

arxiv url: http://arxiv.org/abs/2605.21649v1
Date: Wed, 20 May 2026 19:03:38 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-22 16:35:41.965928
Title: EntmaxKV: Support-Aware Decoding for Entmax Attention
Title（参考訳）: EntmaxKV: Entmax注意のためのサポート対応デコーディング
Authors: Gonçalo Duarte, Miguel Couceiro, Marcos V. Treviso,
Abstract要約: EntmaxKVは、KVページがロードされる前にスパース性を利用する、entmaxネイティブなスパースデコーディングフレームワークである。その結果,出力誤差が$$で制御され,entmaxサポートが回復すると消滅することがわかった。長期コンテキストと言語モデリングのベンチマークでは、KVキャッシュのごく一部を使用しながらフルキャッシュのentmaxと密に一致し、最大3.36times$(softmax)と5.43times$(entmax)のスピードアップを1Mコンテキストで実行している。
参考スコア（独自算出の注目度）: 5.759250057973468
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Long-context decoding is increasingly limited by KV-cache memory traffic since each generated token attends over a cache whose size grows linearly with context length. Existing sparse decoding methods reduce this cost by selecting subsets of tokens or pages, but are designed for softmax attention, whose dense tails make any truncation discard nonzero probability mass. In contrast, $α$-entmax produces exact zeros, turning sparse decoding from dense-tail approximation into support recovery: if the selected candidates contain the entmax support, sparse decoding remains exact. While recent entmax kernels enable efficient training, they do not address the autoregressive decoding bottleneck, where dense inference still streams the full KV cache before sparsity is known. In this work, we introduce EntmaxKV, an entmax-native sparse decoding framework that exploits sparsity before KV pages are loaded. EntmaxKV combines query-aware page scoring, support-aware candidate selection, and sparse entmax attention. We analyze truncation error through the dropped probability mass $δ$, showing that output error is controlled by $δ$ and vanishes when the entmax support is recovered. We further introduce a Gaussian-aware entmax selector that estimates the entmax threshold from lightweight page statistics, adapting the selected budget to the score distribution. Empirically, EntmaxKV drops less probability mass, retains more support tokens, and achieves lower output error than softmax-based sparse decoding at matched KV budgets. On long-context and language modeling benchmarks, it closely matches full-cache entmax while using a small fraction of the KV cache, achieving up to $3.36\times$ (softmax) and $5.43\times$ (entmax) speedup over full attention baselines at 1M context length. Code available at: https://github.com/deep-spin/entmaxkv.
Abstract（参考訳）: 長いコンテキストの復号化は、各生成されたトークンがコンテキスト長とともに線形に成長するキャッシュに付随するため、KVキャッシュメモリトラフィックによってますます制限される。既存のスパース復号法では、トークンやページのサブセットを選択することで、このコストを削減するが、ソフトマックスアテンションのために設計されている。対照的に、$α$-entmax は正確な零点を生成し、密度の高いテール近似からスパースデコーディングをサポートリカバリに変換する: 選択された候補がentmax サポートを含んでいる場合、スパースデコーディングは依然として正確である。最近のentmaxカーネルは効率的なトレーニングを可能にするが、疎結合が知られる前に、高密度な推論が完全なKVキャッシュをストリーミングする自動回帰デコードボトルネックには対処しない。本稿では,KVページがロードされる前の空間性を利用する,entmaxネイティブなスパースデコーディングフレームワークであるEntmaxKVを紹介する。 EntmaxKVは、クエリ対応ページのスコアリング、サポート対応候補の選択、スパースentmaxアテンションを組み合わせたものだ。減少確率質量$δ$を用いてトラクション誤差を分析し、出力誤差が$δ$で制御され、entmaxサポートが回復されたときに消滅することを示す。さらに、軽量なページ統計量からentmax閾値を推定し、選択した予算をスコア分布に適応させるガウス対応entmaxセレクタを導入する。経験的に、EntmaxKVは確率質量を減らし、より多くのサポートトークンを保持し、一致するKV予算でのソフトマックスベースのスパースデコーディングよりも低い出力誤差を達成する。長期コンテキストと言語モデリングのベンチマークでは、KVキャッシュのごく一部を使用しながらフルキャッシュのentmaxと密に一致し、最大$3.36\times$(softmax)と$5.43\times$(entmax)のスピードアップを1Mコンテキストで達成している。コードは、https://github.com/deep-spin/entmaxkv.comで入手できる。

論文の概要: EntmaxKV: Support-Aware Decoding for Entmax Attention

関連論文リスト