Fugu-MT 論文翻訳(概要): vAttention: Verified Sparse Attention

論文の概要: vAttention: Verified Sparse Attention

arxiv url: http://arxiv.org/abs/2510.05688v1
Date: Tue, 07 Oct 2025 08:46:08 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-08 17:57:08.166904
Title: vAttention: Verified Sparse Attention
Title（参考訳）: vAttention: スパース注意の検証
Authors: Aditya Desai, Kumar Krishna Agrawal, Shuo Yang, Alejandro Cuadron, Luis Gaspar Schroeder, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica,
Abstract要約: vAttentionは、ユーザが指定した$(epsilon, delta)$の近似精度保証(thus, confirmed)を備えた実用的なスパースアテンションメカニズムである。 vAttentionはデータセット間のスパースアテンションの質を大幅に改善することを示す。モデルの品質を損なうことなく高速なデコードを実現するために、推論シナリオにデプロイすることができる。
参考スコア（独自算出の注目度）: 100.98210818821688
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: State-of-the-art sparse attention methods for reducing decoding latency fall into two main categories: approximate top-$k$ (and its extension, top-$p$) and recently introduced sampling-based estimation. However, these approaches are fundamentally limited in their ability to approximate full attention: they fail to provide consistent approximations across heads and query vectors and, most critically, lack guarantees on approximation quality, limiting their practical deployment. We observe that top-$k$ and random sampling are complementary: top-$k$ performs well when attention scores are dominated by a few tokens, whereas random sampling provides better estimates when attention scores are relatively uniform. Building on this insight and leveraging the statistical guarantees of sampling, we introduce vAttention, the first practical sparse attention mechanism with user-specified $(\epsilon, \delta)$ guarantees on approximation accuracy (thus, verified). These guarantees make vAttention a compelling step toward practical, reliable deployment of sparse attention at scale. By unifying top-k and sampling, vAttention outperforms both individually, delivering a superior quality-efficiency trade-off. Our experiments show that vAttention significantly improves the quality of sparse attention (e.g., $\sim$4.5 percentage points for Llama-3.1-8B-Inst and Deepseek-R1-Distill-Llama-8B on RULER-HARD), and effectively bridges the gap between full and sparse attention (e.g., across datasets, it matches full model quality with upto 20x sparsity). We also demonstrate that it can be deployed in reasoning scenarios to achieve fast decoding without compromising model quality (e.g., vAttention achieves full model quality on AIME2024 at 10x sparsity with up to 32K token generations). Code is open-sourced at https://github.com/xAlg-ai/sparse-attention-hub.
Abstract（参考訳）: 復号遅延を減らすための最先端のスパースアテンション手法は、2つの主要なカテゴリに該当する: 近似トップ$k$(およびその拡張、トップ$p$)と、最近導入されたサンプリングベース推定である。しかしながら、これらのアプローチは基本的に、すべての注意を近似する能力に制限されている。ヘッドとクエリベクタ間の一貫した近似を提供することができず、そして最も重要なのは、近似品質の保証が欠如しており、実際のデプロイメントが制限されていることだ。上位$k$とランダムサンプリングは相補的である: 上位$k$は、注意スコアがいくつかのトークンで支配されているときにうまく機能するが、一方、ランダムサンプリングは、注意スコアが比較的均一であるときにより良い見積もりを提供する。この知見に基づいてサンプリングの統計的保証を生かし、ユーザ指定の$(\epsilon, \delta)$保証(thus, confirmed)を持つ最初の実用的なスパースアテンション機構であるvAttentionを導入する。これらの保証により、vAttentionは、大規模なスパースアテンションの実用的で信頼性の高いデプロイに向けた魅力的なステップとなる。トップkとサンプリングを統一することで、vAttentionはどちらもパフォーマンスが向上し、優れた品質と効率のトレードオフを提供します。実験の結果,vAttention はスパース注意の質を著しく向上させる(例えば,Llama-3.1-8B-Inst と Deepseek-R1-Distill-Llama-8B,RULER-HARD)。また、モデルの品質を損なうことなく高速なデコーディングを実現するために推論シナリオにデプロイできることを実証した(例えば、vAttentionは、最大32Kトークン世代で10倍の間隔でAIME2024の完全なモデル品質を達成する)。コードはhttps://github.com/xAlg-ai/sparse-attention-hubで公開されている。

論文の概要: vAttention: Verified Sparse Attention

関連論文リスト