Fugu-MT 論文翻訳(概要): Scout Before You Attend: Sketch-and-Walk Sparse Attention for Efficient LLM Inference

論文の概要: Scout Before You Attend: Sketch-and-Walk Sparse Attention for Efficient LLM Inference

arxiv url: http://arxiv.org/abs/2602.07397v1
Date: Sat, 07 Feb 2026 06:27:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-10 20:26:24.601641
Title: Scout Before You Attend: Sketch-and-Walk Sparse Attention for Efficient LLM Inference
Title（参考訳）: 効率的なLSM推論のためのスケッチアンドウォークスパースアテンション(Scetch-and-Walk Sparse Attention)
Authors: Hoang Anh Duy Le, Sahil Joshi, Zeyu Yang, Zhaozhuo Xu, Anshumali Shrivastava,
Abstract要約: トレーニング不要なスパースアテンション手法であるSketch&Walk Attentionを導入する。軽量なスケッチと決定論的ウォークで空間性を決定する。最大6倍の推論スピードアップを達成しながら、いくつかの設定で集中的な注意をわずかに上回ります。
参考スコア（独自算出の注目度）: 34.96871737819456
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Self-attention dominates the computational and memory cost of long-context LLM inference across both prefill and decode phases. To address this challenge, we introduce Sketch&Walk Attention, a training-free sparse attention method that determines sparsity with lightweight sketches and deterministic walk. Sketch&Walk applies Hadamard sketching to get inexpensive approximations of attention scores, then aggregates these estimates across layers via a walk mechanism that captures attention influence beyond direct interactions between tokens. The accumulated walk scores are used to select top-k attention blocks, enabling dynamic sparsity with a single training-free algorithm that applies uniformly to both the prefill and decode phases, together with custom sparse attention kernels. Across a wide range of models and tasks, Sketch&Walk maintains near-lossless accuracy at 20% attention density and can slightly outperform dense attention in some settings, while achieving up to 6x inference speedup.
Abstract（参考訳）: 自己注意(Self-attention)は、プリフィルとデコードの両方のフェーズにわたる長文LLM推論の計算とメモリコストを支配している。この課題に対処するために、スケッチ&ウォーク注意法(Sketch&Walk Attention)を導入する。 Sketch&Walkは、Adamardのスケッチを適用して、アテンションスコアの安価な近似を取得し、それらの見積を、トークン間の直接的な相互作用を超えたアテンションの影響を捉えるウォークメカニズムを通じて階層に集約する。蓄積されたウォークスコアはトップkのアテンションブロックを選択するために使用され、単一のトレーニング不要のアルゴリズムで、カスタムのスパースアテンションカーネルとともにプリフィルとデコードの両方に均一に適用される。広範囲のモデルとタスクにおいて、Sketch&Walkは20%の注意密度でほぼロスレスの精度を維持し、いくつかの設定では集中した注意をわずかに上回り、最大6倍の推論スピードアップを達成することができる。

論文の概要: Scout Before You Attend: Sketch-and-Walk Sparse Attention for Efficient LLM Inference

関連論文リスト