Fugu-MT 論文翻訳(概要): SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning

論文の概要: SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning

arxiv url: http://arxiv.org/abs/2508.15212v1
Date: Thu, 21 Aug 2025 03:48:28 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-22 16:26:46.169188
Title: SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning
Title（参考訳）: SparK: 回復可能なKVキャッシュチャネルプルーニングを備えたクエリ対応非構造化スパリティ
Authors: Huanxuan Liao, Yixing Xu, Shizhu He, Guanchen Li, Xuanwu Yin, Dong Li, Emad Barsoum, Jun Zhao, Kang Liu,
Abstract要約: 大規模言語モデルにおける長文推論は、KVキャッシュのボトルネックによってますます制限される。チャネルレベルでKVをプルーニングすることで、非構造化空間を適用できる訓練不要なプラグアンドプレイ手法であるSPARKを提案する。 SPARKはチャネルレベルの冗長性を低減し、同じメモリ予算内で長いシーケンスの処理を可能にする。
参考スコア（独自算出の注目度）: 26.26715997974707
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Long-context inference in large language models (LLMs) is increasingly constrained by the KV cache bottleneck: memory usage grows linearly with sequence length, while attention computation scales quadratically. Existing approaches address this issue by compressing the KV cache along the temporal axis through strategies such as token eviction or merging to reduce memory and computational overhead. However, these methods often neglect fine-grained importance variations across feature dimensions (i.e., the channel axis), thereby limiting their ability to effectively balance efficiency and model accuracy. In reality, we observe that channel saliency varies dramatically across both queries and positions: certain feature channels carry near-zero information for a given query, while others spike in relevance. To address this oversight, we propose SPARK, a training-free plug-and-play method that applies unstructured sparsity by pruning KV at the channel level, while dynamically restoring the pruned entries during attention score computation. Notably, our approach is orthogonal to existing KV compression and quantization techniques, making it compatible for integration with them to achieve further acceleration. By reducing channel-level redundancy, SPARK enables processing of longer sequences within the same memory budget. For sequences of equal length, SPARK not only preserves or improves model accuracy but also reduces KV cache storage by over 30% compared to eviction-based methods. Furthermore, even with an aggressive pruning ratio of 80%, SPARK maintains performance with less degradation than 5% compared to the baseline eviction method, demonstrating its robustness and effectiveness. Our code will be available at https://github.com/Xnhyacinth/SparK.
Abstract（参考訳）: 大規模言語モデル(LLM)の長文推論は、KVキャッシュのボトルネックによってますます制限され、メモリ使用量はシーケンス長とともに線形に増加し、注意計算は2次的にスケールする。既存のアプローチでは、トークンの排除やマージといった戦略を通じて、時間軸に沿ってKVキャッシュを圧縮することで、メモリと計算オーバーヘッドを減らすことでこの問題に対処している。しかし、これらの手法は特徴次元(すなわちチャネル軸)にまたがる微妙な重要性のばらつきを無視し、効率性とモデルの精度を効果的にバランスさせる能力を制限する。実際には、あるフィーチャーチャネルは、あるクエリのほぼゼロの情報を持ち、他のチャンネルは、関連性を高めている。そこで本研究では,KVをチャネルレベルでプルーニングし,注目スコア計算中にプルーンドエントリを動的に復元することで,非構造的空間性を適用したトレーニングフリーなプラグアンドプレイ手法であるSPARKを提案する。特に,本手法は既存のKV圧縮および量子化技術と直交しており,さらなる加速を実現するために,KV圧縮と量子化技術との整合性を実現している。 SPARKはチャネルレベルの冗長性を低減し、同じメモリ予算内で長いシーケンスの処理を可能にする。等しい長さのシーケンスの場合、SPARKはモデル精度を保存または改善するだけでなく、エビクションベースの方法と比較してKVキャッシュストレージを30%以上削減する。さらに,80%のアグレッシブプルーニング比であっても,SPARKは基準線消去法に比べて5%未満の劣化率で性能を保ち,その堅牢性と有効性を示す。私たちのコードはhttps://github.com/Xnhyacinth/SparK.comで公開されます。

論文の概要: SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning

関連論文リスト