Fugu-MT 論文翻訳(概要): SABlock: Semantic-Aware KV Cache Eviction with Adaptive Compression Block Size

論文の概要: SABlock: Semantic-Aware KV Cache Eviction with Adaptive Compression Block Size

arxiv url: http://arxiv.org/abs/2510.22556v1
Date: Sun, 26 Oct 2025 07:17:10 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-28 15:28:15.244978
Title: SABlock: Semantic-Aware KV Cache Eviction with Adaptive Compression Block Size
Title（参考訳）: SABlock: 適応圧縮ブロックサイズを持つセマンティックなKVキャッシュ定義
Authors: Jinhan Chen, Jianchun Liu, Hongli Xu, Xianjun Gao, Shilong Wang,
Abstract要約: SABlockは、アンダーラインブロックサイズを持つアンダーラインセマンティックなKVキャッシュ消去フレームワークである。 SABlockはまずセマンティックセグメンテーションを行い、圧縮境界を言語構造と整合させ、次にセグメント誘導トークンスコアリングを適用してトークンの重要度を推定する。長期コンテキストベンチマークの実験では、SABlockは最先端のベースラインを一貫して上回っている。
参考スコア（独自算出の注目度）: 20.4175480790854
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The growing memory footprint of the Key-Value (KV) cache poses a severe scalability bottleneck for long-context Large Language Model (LLM) inference. While KV cache eviction has emerged as an effective solution by discarding less critical tokens, existing token-, block-, and sentence-level compression methods struggle to balance semantic coherence and memory efficiency. To this end, we introduce SABlock, a \underline{s}emantic-aware KV cache eviction framework with \underline{a}daptive \underline{block} sizes. Specifically, SABlock first performs semantic segmentation to align compression boundaries with linguistic structures, then applies segment-guided token scoring to refine token importance estimation. Finally, for each segment, a budget-driven search strategy adaptively determines the optimal block size that preserves semantic integrity while improving compression efficiency under a given cache budget. Extensive experiments on long-context benchmarks demonstrate that SABlock consistently outperforms state-of-the-art baselines under the same memory budgets. For instance, on Needle-in-a-Haystack (NIAH), SABlock achieves 99.9% retrieval accuracy with only 96 KV entries, nearly matching the performance of the full-cache baseline that retains up to 8K entries. Under a fixed cache budget of 1,024, SABlock further reduces peak memory usage by 46.28% and achieves up to 9.5x faster decoding on a 128K context length.
Abstract（参考訳）: キーバリュー(KV)キャッシュのメモリフットプリントの増加は、Long-context Large Language Model(LLM)推論に深刻なスケーラビリティのボトルネックをもたらす。 KVキャッシュの排除は、あまり重要でないトークンを捨てることによる効果的な解決策として現れてきたが、既存のトークン、ブロック、文レベルの圧縮メソッドは、セマンティックコヒーレンスとメモリ効率のバランスをとるのに苦労している。この目的のために、SABlock, a \underline{s}emantic-aware KV cache eviction framework with \underline{a}daptive \underline{block} sizesを紹介する。具体的には、SABlockはまずセマンティックセグメンテーションを行い、圧縮境界を言語構造と整列させ、次にセグメント誘導トークンスコアリングを適用してトークンの重要度を推定する。最後に、各セグメントに対して、所定のキャッシュ予算下で圧縮効率を向上しつつセマンティックな整合性を維持する最適なブロックサイズを、予算主導の探索戦略が適応的に決定する。長期コンテキストベンチマークの大規模な実験により、SABlockは、同じメモリ予算の下で、常に最先端のベースラインを上回っていることが示された。例えば、Needle-in-a-Haystack (NIAH)では、SABlockは96KVのエントリで99.9%の検索精度を達成しており、最大8Kのエントリを保持するフルキャッシュベースラインのパフォーマンスとほぼ一致している。固定キャッシュ予算の1,024では、SABlockはさらにピークメモリ使用量を46.28%削減し、128Kのコンテキスト長で最大9.5倍高速なデコードを実現している。

論文の概要: SABlock: Semantic-Aware KV Cache Eviction with Adaptive Compression Block Size

関連論文リスト