Fugu-MT 論文翻訳(概要): EVICPRESS: Joint KV-Cache Compression and Eviction for Efficient LLM Serving

論文の概要: EVICPRESS: Joint KV-Cache Compression and Eviction for Efficient LLM Serving

arxiv url: http://arxiv.org/abs/2512.14946v1
Date: Tue, 16 Dec 2025 22:21:55 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-18 17:06:26.800108
Title: EVICPRESS: Joint KV-Cache Compression and Eviction for Efficient LLM Serving
Title（参考訳）: EVICPRESS:効率的なLDM実行のための共同KVキャッシュ圧縮と評価
Authors: Shaoting Feng, Yuhan Liu, Hanchen Li, Xiaokun Chen, Samuel Shen, Kuntai Du, Zhuohan Gu, Rui Zhang, Yuyang Huang, Yihua Cheng, Jiayi Yao, Qizheng Zhang, Ganesh Ananthanarayanan, Junchen Jiang,
Abstract要約: KVキャッシュの再利用はLarge Language Model(LLM)推論システムの高効率化に不可欠である。以前の作業では、KVキャッシュを低層ストレージデバイスに解放するか、KVキャッシュを圧縮して、より多くのKVキャッシュを高速メモリに適合させることが提案されていた。複数のストレージ層にまたがるKVキャッシュに損失圧縮と適応消去を適用したKVキャッシュ管理システムEVICPRESSを提案する。
参考スコア（独自算出の注目度）: 27.616284276071855
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reusing KV cache is essential for high efficiency of Large Language Model (LLM) inference systems. With more LLM users, the KV cache footprint can easily exceed GPU memory capacity, so prior work has proposed to either evict KV cache to lower-tier storage devices, or compress KV cache so that more KV cache can be fit in the fast memory. However, prior work misses an important opportunity: jointly optimizing the eviction and compression decisions across all KV caches to minimize average generation latency without hurting quality. We propose EVICPRESS, a KV-cache management system that applies lossy compression and adaptive eviction to KV cache across multiple storage tiers. Specifically, for each KV cache of a context, EVICPRESS considers the effect of compression and eviction of the KV cache on the average generation quality and delay across all contexts as a whole. To achieve this, EVICPRESS proposes a unified utility function that quantifies the effect of quality and delay of the lossy compression or eviction. To this end, EVICPRESS's profiling module periodically updates the utility function scores on all possible eviction-compression configurations for all contexts and places KV caches using a fast heuristic to rearrange KV caches on all storage tiers, with the goal of maximizing the utility function scores on each storage tier. Compared to the baselines that evict KV cache or compress KV cache, EVICPRESS achieves higher KV-cache hit rates on fast devices, i.e., lower delay, while preserving high generation quality by applying conservative compression to contexts that are sensitive to compression errors. Evaluation on 12 datasets and 5 models demonstrates that EVICPRESS achieves up to 2.19x faster time-to-first-token (TTFT) at equivalent generation quality.
Abstract（参考訳）: KVキャッシュの再利用はLarge Language Model(LLM)推論システムの高効率化に不可欠である。より多くのLLMユーザの場合、KVキャッシュフットプリントはGPUメモリ容量をはるかに超えるため、KVキャッシュを低層ストレージデバイスに解放するか、KVキャッシュを圧縮して、より多くのKVキャッシュを高速メモリに適合させることが提案されている。しかしながら、以前の作業は重要な機会を逃している: 品質を損なうことなく、平均生成遅延を最小限に抑えるために、すべてのKVキャッシュ間での排除と圧縮の決定を共同で最適化する。複数のストレージ層にまたがるKVキャッシュに損失圧縮と適応消去を適用したKVキャッシュ管理システムEVICPRESSを提案する。具体的には、コンテキストの各KVキャッシュに対して、EVICPRESSは、すべてのコンテキストにわたる平均生成品質と遅延に対するKVキャッシュの圧縮と消去の影響を、全体として考慮する。これを実現するために、EVICPRESSは、損失圧縮または消去の品質と遅延の影響を定量化する統一ユーティリティ関数を提案する。この目的のために、EVICPRESSのプロファイリングモジュールは、すべてのコンテキストと場所の可能なすべてのエビクション圧縮設定に関するユーティリティ関数スコアを定期的に更新し、KVキャッシュを高速なヒューリスティックですべてのストレージ層で再配置し、各ストレージ層におけるユーティリティ関数スコアを最大化する。 KVキャッシュや圧縮KVキャッシュを排除したベースラインと比較して、EVICPRESSは高速デバイス、すなわち低遅延でより高いKVキャッシュヒット率を達成すると同時に、圧縮エラーに敏感なコンテキストに保守的な圧縮を適用することで、高世代品質を保っている。 12のデータセットと5つのモデルの評価は、EVICPRESSが同等の生成品質で最大2.19倍高速なTTFT(Time-to-first-token)を達成することを示す。

論文の概要: EVICPRESS: Joint KV-Cache Compression and Eviction for Efficient LLM Serving

関連論文リスト