Fugu-MT 論文翻訳(概要): SALS: Sparse Attention in Latent Space for KV cache Compression

論文の概要: SALS: Sparse Attention in Latent Space for KV cache Compression

arxiv url: http://arxiv.org/abs/2510.24273v1
Date: Tue, 28 Oct 2025 10:32:52 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-29 15:35:37.021476
Title: SALS: Sparse Attention in Latent Space for KV cache Compression
Title（参考訳）: SALS: KVキャッシュ圧縮のための遅延空間におけるスパース注意
Authors: Junlin Mu, Hantao Huang, Jihang Zhang, Minghui Yu, Tao Wang, Yidong Li,
Abstract要約: 本稿では,鍵ベクトルへのRoPEの適用により,それらの分散が増大し,結果として高い階数が得られること,鍵ベクトルが潜在空間に変換された後に,ほとんどの層にわたって表現が維持されること,という2つの重要な知見を紹介する。これらの知見に基づき、我々はラテントスペースフレームワークにおけるスパースアテンション(Sparse Attention in Latent Space)を提案する。SALSはKVキャッシュをローランクプロジェクションを介してコンパクトなラテント空間に投影し、この空間でRoPEフリークエリキーインタラクションを用いてスパーストークン選択を行う。
参考スコア（独自算出の注目度）: 17.28816246273855
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models capable of handling extended contexts are in high demand, yet their inference remains challenging due to substantial Key-Value cache size and high memory bandwidth requirements. Previous research has demonstrated that KV cache exhibits low-rank characteristics within the hidden dimension, suggesting the potential for effective compression. However, due to the widely adopted Rotary Position Embedding mechanism in modern LLMs, naive low-rank compression suffers severe accuracy degradation or creates a new speed bottleneck, as the low-rank cache must first be reconstructed in order to apply RoPE. In this paper, we introduce two key insights: first, the application of RoPE to the key vectors increases their variance, which in turn results in a higher rank; second, after the key vectors are transformed into the latent space, they largely maintain their representation across most layers. Based on these insights, we propose the Sparse Attention in Latent Space framework. SALS projects the KV cache into a compact latent space via low-rank projection, and performs sparse token selection using RoPE-free query-key interactions in this space. By reconstructing only a small subset of important tokens, it avoids the overhead of full KV cache reconstruction. We comprehensively evaluate SALS on various tasks using two large-scale models: LLaMA2-7b-chat and Mistral-7b, and additionally verify its scalability on the RULER-128k benchmark with LLaMA3.1-8B-Instruct. Experimental results demonstrate that SALS achieves SOTA performance by maintaining competitive accuracy. Under different settings, SALS achieves 6.4-fold KV cache compression and 5.7-fold speed-up in the attention operator compared to FlashAttention2 on the 4K sequence. For the end-to-end throughput performance, we achieves 1.4-fold and 4.5-fold improvement compared to GPT-fast on 4k and 32K sequences, respectively.
Abstract（参考訳）: 拡張コンテキストを扱うことができる大規模言語モデルは高い需要があるが、キーバリューのキャッシュサイズとメモリ帯域幅の要求が大きいため、推論は依然として困難である。従来の研究では、KVキャッシュは隠れ次元内の低ランク特性を示しており、効率的な圧縮の可能性を示している。しかし, 現代のLLMにおいて広く採用されているロータリー位置埋め込み機構により, ナイーブ低ランク圧縮の精度低下や, 新たな速度ボトルネックが発生している。本稿では,鍵ベクトルへのRoPEの適用により,それらの分散が増大し,結果として高い階数が得られること,鍵ベクトルが潜在空間に変換された後に,ほとんどの層にわたって表現が維持される,という2つの重要な知見を紹介する。これらの知見に基づき、ラテントスペースフレームワークにおけるスパースアテンションを提案する。 SALSは、KVキャッシュを低ランクプロジェクションを介してコンパクトな潜在空間に投影し、この空間でRoPEフリークエリキー相互作用を用いてスパーストークン選択を行う。重要なトークンの小さなサブセットだけを再構築することで、完全なKVキャッシュ再構築のオーバーヘッドを回避することができる。 LLaMA2-7b-chatとMistral-7bの2つの大規模モデルを用いてSALSを総合的に評価し、LLaMA3.1-8B-InstructによるRULER-128kベンチマークでそのスケーラビリティを検証した。実験の結果,SALSは競争精度を保ち,SOTA性能を達成することが示された。異なる設定では、SALSは4Kシーケンス上のFlashAttention2と比較して6.4倍のKVキャッシュ圧縮と5.7倍のスピードアップを実現している。エンドツーエンドのスループット性能では、4kおよび32KのGPT速さと比較して1.4倍と4.5倍の改善を実現した。

論文の概要: SALS: Sparse Attention in Latent Space for KV cache Compression

関連論文リスト