Fugu-MT 論文翻訳(概要): Sparser Block-Sparse Attention via Token Permutation

論文の概要: Sparser Block-Sparse Attention via Token Permutation

arxiv url: http://arxiv.org/abs/2510.21270v1
Date: Fri, 24 Oct 2025 09:11:50 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-28 02:52:26.961267
Title: Sparser Block-Sparse Attention via Token Permutation
Title（参考訳）: トークン置換によるスペーサーブロックスパース注意
Authors: Xinghao Wang, Pengyu Wang, Dong Zhang, Chenkun Tan, Shaojun Zhou, Zhaoxiang Liu, Shiguo Lian, Fangxu Liu, Kai Song, Xipeng Qiu,
Abstract要約: 本稿では,ブロックレベルの空間性を高めるために,注目の置換特性を活用するプラグイン・アンド・プレイ方式であるPermuted Block-Sparse Attention (textbfPBS-Attn)を提案する。 PBS-Attnは、カスタムのpermuted-FlashAttentionカーネルをベースとして、長文プリフィルで最大2.75タイムのエンドツーエンドのスピードアップを実現しています。
参考スコア（独自算出の注目度）: 46.22204775916057
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Scaling the context length of large language models (LLMs) offers significant benefits but is computationally expensive. This expense stems primarily from the self-attention mechanism, whose $O(N^2)$ complexity with respect to sequence length presents a major bottleneck for both memory and latency. Fortunately, the attention matrix is often sparse, particularly for long sequences, suggesting an opportunity for optimization. Block-sparse attention has emerged as a promising solution that partitions sequences into blocks and skips computation for a subset of these blocks. However, the effectiveness of this method is highly dependent on the underlying attention patterns, which can lead to sub-optimal block-level sparsity. For instance, important key tokens for queries within a single block may be scattered across numerous other blocks, leading to computational redundancy. In this work, we propose Permuted Block-Sparse Attention (\textbf{PBS-Attn}), a plug-and-play method that leverages the permutation properties of attention to increase block-level sparsity and enhance the computational efficiency of LLM prefilling. We conduct comprehensive experiments on challenging real-world long-context datasets, demonstrating that PBS-Attn consistently outperforms existing block-sparse attention methods in model accuracy and closely matches the full attention baseline. Powered by our custom permuted-FlashAttention kernels, PBS-Attn achieves an end-to-end speedup of up to $2.75\times$ in long-context prefilling, confirming its practical viability. Code available at https://github.com/xinghaow99/pbs-attn
Abstract（参考訳）: 大規模言語モデル(LLM)のコンテキスト長のスケーリングは大きなメリットがあるが、計算コストが高い。このコストは主に自己保持機構に起因しており、シーケンス長に対する$O(N^2)$の複雑さは、メモリとレイテンシの両方において大きなボトルネックとなる。幸いなことに、注意行列はしばしばスパースであり、特に長い列に対して、最適化の機会を示唆している。ブロックスパース(Block-sparse)は、シーケンスをブロックに分割し、これらのブロックのサブセットの計算をスキップする有望なソリューションとして登場した。しかし,本手法の有効性は,その基盤となる注意パターンに大きく依存している。例えば、単一のブロック内のクエリに対する重要なキートークンは、他の多くのブロックに分散して、計算の冗長性につながる可能性がある。本研究では, ブロックレベルの空間性を高め, LLMプリフィルの計算効率を高めるために, 注目点の置換特性を利用するプラグアンドプレイ方式であるPermuted Block-Sparse Attention (\textbf{PBS-Attn})を提案する。本研究では, PBS-Attnが既存のブロックスパースアテンション手法をモデル精度で一貫して上回り, 全アテンションベースラインと密に一致していることを示す。 PBS-Attnは、我々のカスタムのpermuted-FlashAttentionカーネルによって、最大2.75\times$のエンドツーエンドのスピードアップを実現し、その実用性を確認する。 https://github.com/xinghaow99/pbs-attn

論文の概要: Sparser Block-Sparse Attention via Token Permutation

関連論文リスト