Fugu-MT 論文翻訳(概要): CSAttention: Centroid-Scoring Attention for Accelerating LLM Inference

論文の概要: CSAttention: Centroid-Scoring Attention for Accelerating LLM Inference

arxiv url: http://arxiv.org/abs/2604.08584v1
Date: Mon, 30 Mar 2026 01:42:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-19 19:09:11.461869
Title: CSAttention: Centroid-Scoring Attention for Accelerating LLM Inference
Title（参考訳）: CSAttention: LLM推論の高速化のためのCentroid-Scoring Attention
Authors: Chuxu Song, Zhencan Peng, Jiuqi Wei, Chuanhui Yang,
Abstract要約: CSAttention(Centroid-Scoring Attention)は、高スループットコンテキストに最適化された訓練不要のスパースアテンション手法である。計算処理を1回のオフラインプリフィルフェーズにフロントロードし、複数のクエリでアモートできる。モデル精度と推論速度の両方において、最先端のスパースアテンション手法より一貫して優れている。
参考スコア（独自算出の注目度）: 3.1255988998610307
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Long-context LLMs increasingly rely on extended, reusable prefill prompts for agents and domain Q&A, pushing attention and KV-cache to become the dominant decode-time bottlenecks. While sparse attention reduces computation and transfer costs, it often struggles to maintain accuracy at high sparsity levels due to the inherent distribution shift between Queries and Keys. We propose Centroid-Scoring Attention (CSAttention), a training-free sparse attention method optimized for high-throughput serving of reusable contexts. CSAttention adopts a storage-for-computation strategy tailored to the offline-prefill/online-decode setting: it front-loads computation into a one-time offline prefill phase that can be amortized across multiple queries, while aggressively optimizing per-step decoding latency. Specifically, CSAttention constructs query-centric lookup tables during offline prefill, whose size remains fixed during decoding, and enables online decoding to replace full-context scans with efficient table lookups and GPU-friendly score accumulation. Extensive experiments demonstrate that CSAttention achieves near-identical accuracy to full attention. Under high sparsity (95%) and long-context settings (32K-128K), CSAttention consistently outperforms state-of-the-art sparse attention methods in both model accuracy and inference speed, achieving up to 4.6x inference speedup over the most accurate baseline at a context length of 128K.
Abstract（参考訳）: 長いコンテキストのLLMは、エージェントとドメインQ&Aのための拡張された再利用可能なプリフィルプロンプトにますます依存し、注意を喚起し、KVキャッシュが主要なデコード時間のボトルネックとなる。スパースアテンションは計算と転送コストを削減しますが、キューとキーの間に固有の分散シフトがあるため、高い疎度で精度を維持するのに苦労することが多いのです。 CSAttention(Centroid-Scoring Attention)は、再利用可能なコンテキストの高スループット提供に最適化された訓練不要なスパースアテンション手法である。 CSAttentionはオフラインのプリフィル/オンライン・デコード設定に合わせたストレージ・フォー・コンピュテーション・ストラテジを採用している。複数のクエリにまたがってアモート可能な1時間のオフライン・プリフィルフェーズに、計算処理をフロントロードすると同時に、ステップ毎のデコードレイテンシを積極的に最適化する。具体的には、CSAttentionはオフラインのプリフィル中にクエリ中心のルックアップテーブルを構築し、デコード時にサイズが固定され、オンラインデコードにより、フルコンテキストのスキャンを効率的なテーブルルックアップとGPUフレンドリなスコアの蓄積に置き換えることができる。広範囲な実験により、CSAttentionは全注意にほぼ同一の精度を達成していることが示された。高間隔 (95%) と長コンテキスト設定 (32K-128K) の下では、CSAttention はモデル精度と推論速度の両方において常に最先端のスパースアテンション手法より優れ、コンテキスト長128Kで最も正確なベースライン上で最大4.6倍の推論スピードアップを達成する。

論文の概要: CSAttention: Centroid-Scoring Attention for Accelerating LLM Inference

関連論文リスト