Fugu-MT 論文翻訳(概要): IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs

論文の概要: IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs

arxiv url: http://arxiv.org/abs/2604.10539v1
Date: Sun, 12 Apr 2026 09:02:20 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-14 20:13:16.084914
Title: IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs
Title（参考訳）: アイスキャッシュ:長期LLMにおけるメモリ効率のよいKVキャッシュ管理
Authors: Yuzhen Mao, Qitong Wang, Martin Ester, Ke Li,
Abstract要約: キーバリュー(KV)キャッシュは、大規模言語モデルにおける推論の加速に重要な役割を果たす。セマンティックトークンクラスタリングとPagedAttentionを統合した新しいKVキャッシュ管理戦略を提案する。 256の予算で、IceCacheは完全なKVキャッシュモデルによって達成された元の精度の99%を維持している。
参考スコア（独自算出の注目度）: 12.353502602473695
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Key-Value (KV) cache plays a crucial role in accelerating inference in large language models (LLMs) by storing intermediate attention states and avoiding redundant computation during autoregressive generation. However, its memory footprint scales linearly with sequence length, often leading to severe memory bottlenecks on resource-constrained hardware. Prior work has explored offloading KV cache to the CPU while retaining only a subset on the GPU, but these approaches often rely on imprecise token selection and suffer performance degradation in long-generation tasks such as chain-of-thought reasoning. In this paper, we propose a novel KV cache management strategy, IceCache, which integrates semantic token clustering with PagedAttention. By organizing semantically related tokens into contiguous memory regions managed by a hierarchical, dynamically updatable data structure, our method enables more efficient token selection and better utilization of memory bandwidth during CPU-GPU transfers. Experimental results on LongBench show that, with a 256-token budget, IceCache maintains 99% of the original accuracy achieved by the full KV cache model. Moreover, compared to other offloading-based methods, IceCache attains competitive or even superior latency and accuracy while using only 25% of the KV cache token budget, demonstrating its effectiveness in long-sequence scenarios. The code is available on our project website at https://yuzhenmao.github.io/IceCache/.
Abstract（参考訳）: キーバリュー(KV)キャッシュは、中間注意状態を格納し、自己回帰生成時に冗長な計算を避けることで、大きな言語モデル(LLM)の推論を加速する上で重要な役割を果たす。しかし、メモリフットプリントはシーケンス長と線形にスケールし、しばしばリソース制約のハードウェア上で深刻なメモリボトルネックを引き起こす。以前の作業では、GPUのサブセットのみを保持しながら、KVキャッシュをCPUにオフロードすることを検討したが、これらのアプローチは、不正確なトークンの選択に頼り、チェーンオブソート推論のような長期的なタスクでパフォーマンス劣化を被ることが多い。本稿では,PagedAttentionとセマンティックトークンクラスタリングを統合した新しいKVキャッシュ管理戦略IceCacheを提案する。階層的で動的に更新可能なデータ構造によって管理される連続メモリ領域に意味論的に関連付けられたトークンを整理することにより、CPU-GPU転送時のより効率的なトークン選択とメモリ帯域幅の活用が可能になる。 LongBenchの実験結果によると、256トンの予算で、IceCacheは完全なKVキャッシュモデルによって達成された元の精度の99%を維持している。さらに、他のオフロードベースの方法と比較して、IceCacheはKVキャッシュトークンの予算の25%しか使用せず、競合的あるいはさらに優れたレイテンシと精度を実現し、長時間のシナリオでその有効性を実証している。コードはプロジェクトのWebサイトhttps://yuzhenmao.github.io/IceCache/.com/で公開されている。

論文の概要: IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs

関連論文リスト