Krul: Efficient State Restoration for Multi-turn Conversations with Dynamic Cross-layer KV Sharing
- URL: http://arxiv.org/abs/2507.08045v1
- Date: Thu, 10 Jul 2025 01:51:17 GMT
- Title: Krul: Efficient State Restoration for Multi-turn Conversations with Dynamic Cross-layer KV Sharing
- Authors: Junyi Wen, Junyuan Liang, Zicong Hong, Wuhui Chen, Zibin Zheng,
- Abstract summary: We present Krul, a multi-turn LLM inference system that enables accurate and efficient KV cache restoration.<n>Krul selects compression strategies based on attention similarity across layer pairs and uses a recomputation-loading pipeline to restore the KV cache.<n>It achieves a 1.5x-2.68x reduction in time-to-first-token (TTFT) and a 1.33x-2.35x reduction in KV cache storage.
- Score: 24.159793132911954
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Efficient state restoration in multi-turn conversations with large language models (LLMs) remains a critical challenge, primarily due to the overhead of recomputing or loading full key-value (KV) caches for all historical tokens. To address this, existing approaches compress KV caches across adjacent layers with highly similar attention patterns. However, these methods often apply a fixed compression scheme across all conversations, selecting the same layer pairs for compression without considering conversation-specific attention dynamics. This static strategy overlooks variability in attention pattern similarity across different conversations, which can lead to noticeable accuracy degradation. We present Krul, a multi-turn LLM inference system that enables accurate and efficient KV cache restoration. Krul dynamically selects compression strategies based on attention similarity across layer pairs and uses a recomputation-loading pipeline to restore the KV cache. It introduces three key innovations: 1) a preemptive compression strategy selector to preserve critical context for future conversation turns and selects a customized strategy for the conversation; 2) a token-wise heterogeneous attention similarity estimator to mitigate the attention similarity computation and storage overhead during model generation; 3) a bubble-free restoration scheduler to reduce potential bubbles brought by the imbalance of recomputing and loading stream due to compressed KV caches. Empirical evaluations on real-world tasks demonstrate that Krul achieves a 1.5x-2.68x reduction in time-to-first-token (TTFT) and a 1.33x-2.35x reduction in KV cache storage compared to state-of-the-art methods without compromising generation quality.
Related papers
- ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration [81.81027217759433]
Large language models (LLMs) are often constrained by the excessive memory required to store the Key-Value ( KV) cache.<n>Recent methods have explored reducing the hidden dimensions of the KV cache, but many introduce additional computation through projection layers.<n>We propose ReCalKV, a post-training KV cache compression method that reduces the hidden dimensions of the KV cache.
arXiv Detail & Related papers (2025-05-30T08:49:27Z) - FlowKV: Enhancing Multi-Turn Conversational Coherence in LLMs via Isolated Key-Value Cache Management [27.734106884226005]
FlowKV is a novel multi-turn isolation mechanism for KV Cache management.<n>It preserves the accumulated compressed KV cache from past turns.<n>It prevents the re-compression of older context and thereby mitigating catastrophic forgetting.
arXiv Detail & Related papers (2025-05-21T10:20:46Z) - KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference [16.53643930310808]
KeepKV is a novel adaptive KV cache merging method designed to eliminate output perturbation while preserving performance under strict memory constraints.<n>We show that KeepKV substantially reduces memory usage, enhances inference throughput by more than 2x and keeps superior generation quality even with 10% KV cache budgets.
arXiv Detail & Related papers (2025-04-14T06:58:00Z) - DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance [125.81664663201282]
We introduce a new KV cache compression method dubbed DBudgetKV.<n>It features an attention-based metric to signal when the remaining KV cache is unlikely to match the full-cache performance.<n>Our method achieves lossless KV pruning effectively and robustly, exceeding 25% compression ratio on average.
arXiv Detail & Related papers (2025-02-24T06:33:39Z) - More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression [71.42818367729573]
In large language models (LLMs), the memory usage of KV cache has become a critical bottleneck during inference.<n>The mainstream KV compression methods, including KV pruning and KV quantization, primarily focus on either token or precision dimension separately.<n>In this paper, we comprehensively investigate the token-precision trade-off in KV cache compression.
arXiv Detail & Related papers (2024-12-17T09:20:31Z) - EMS: Adaptive Evict-then-Merge Strategy for Head-wise KV Cache Compression Based on Global-Local Importance [44.14919492126948]
As memory overhead becomes a significant concern, efficient compression of KV cache has gained increasing attention.<n>We propose EMS to overcome these limitations, while achieving better KV cache compression under extreme compression ratios.<n> EMS consistently achieves the lowest perplexity, improves scores by over 1.28 points across four LLMs on LongBench under a 256 cache budget, and preserves 95% retrieval accuracy with a cache budget less than 2% of the context length in the Needle-in-a-Haystack task.
arXiv Detail & Related papers (2024-12-11T16:35:13Z) - ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression [10.003118268356017]
Long context poses significant challenges for inference efficiency.<n>We introduce ClusterKV, which recalls tokens at the granularity of semantic clusters.<n>Experiment results show that ClusterKV attains negligible accuracy loss across various tasks with 32k context lengths.
arXiv Detail & Related papers (2024-12-04T10:58:27Z) - Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity [24.118503938098307]
textscPoD allocates memory according to token importance.<n>textscPoD reduces KV cache memory usage by up to 35% without compromising performance.
arXiv Detail & Related papers (2024-12-03T08:29:27Z) - KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing [58.29726147780976]
We propose a plug-and-play method called textit KVSharer, which shares the KV cache between layers to achieve layer-wise compression.
Experiments show that textit KVSharer can reduce KV cache computation by 30%, thereby lowering memory consumption.
We verify that textit KVSharer is compatible with existing intra-layer KV cache compression methods, and combining both can further save memory.
arXiv Detail & Related papers (2024-10-24T08:06:41Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - LoCoCo: Dropping In Convolutions for Long Context Compression [77.26610232994508]
This paper presents a novel approach, Dropping In Convolutions for Long Context Compression (LoCoCo)
LoCoCo employs only a fixed-size Key-Value ( KV) cache, and can enhance efficiency in both inference and fine-tuning stages.
arXiv Detail & Related papers (2024-06-08T01:35:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.