DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity
- URL: http://arxiv.org/abs/2602.08005v1
- Date: Sun, 08 Feb 2026 15:14:36 GMT
- Title: DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity
- Authors: Jitai Hao, Qiang Huang, Yaowei Wang, Min Zhang, Jun Yu,
- Abstract summary: We propose a residual-based KV cache compression framework motivated by long-range inter-token similarity and highly shared latent components in KV representations.<n>Instead of discarding tokens, DeltaKV encodes semantic residuals relative to retrieved historical references, preserving fidelity while substantially reducing storage.<n>Experiments show that DeltaKV reduces KV cache memory to 29% of the original while maintaining near-lossless accuracy on LongBench, SCBench, and AIME.
- Score: 50.52392445266824
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The deployment of efficient long-context LLMs in applications like autonomous agents, long-chain reasoning, and creative writing is fundamentally bottlenecked by the linear growth of KV cache memory. Existing compression and eviction methods often struggle to balance accuracy, compression ratio, and hardware efficiency. We propose DeltaKV, a residual-based KV cache compression framework motivated by two empirical findings: long-range inter-token similarity and highly shared latent components in KV representations. Instead of discarding tokens, DeltaKV encodes semantic residuals relative to retrieved historical references, preserving fidelity while substantially reducing storage. To translate compression gains into real system speedups, we further introduce Sparse-vLLM, a high-performance inference engine with decoupled memory management and kernels optimized for sparse and irregular KV layouts. Experiments show that DeltaKV reduces KV cache memory to 29\% of the original while maintaining near-lossless accuracy on LongBench, SCBench, and AIME. When integrated with Sparse-vLLM, it achieves up to 2$\times$ throughput improvement over vLLM in long-context scenarios, demonstrating a practical path toward scalable long-context LLM deployment. Code, model checkpoints, and datasets are available at https://github.com/CURRENTF/Sparse-vLLM.
Related papers
- Joint Encoding of KV-Cache Blocks for Scalable LLM Serving [3.3230675313521716]
Existing KV-cache compression methods rely on rigids, disrupt tensor layouts, or require specialized compute.<n>We propose joint encoding of KV-cache blocks, which fuses similar blocks across requests and input chunks into shared representations.<n>This alleviates the KV-cache memory bottleneck, supporting high-concurrency serving without specialized hardware.
arXiv Detail & Related papers (2026-01-06T14:50:58Z) - KV Cache Transform Coding for Compact Storage in LLM Inference [2.20003167536462]
We present KVTC, a lightweight transform coder that compresses KV caches for compact on- GPU and off- GPU storage.<n>By exploiting redundancies in KV caches, KVTC achieves up to 20$times$ compression while maintaining reasoning and long-context accuracy.<n>We test KVTC with Llama 3, Mistral NeMo, and R1-Qwen 2.5 models across benchmarks including AIME25, LiveCodeBench, GSM8K, MMLU, Qasper, RULER, and MATH-500.
arXiv Detail & Related papers (2025-11-03T18:20:35Z) - KVComp: A High-Performance, LLM-Aware, Lossy Compression Framework for KV Cache [7.019967158501771]
We present KVComp, a generic and efficient KV cache management framework optimized for long-text generation.<n> KVComp employs novel lossy compression techniques specifically designed for KV cache data characteristics.<n>We show that KVComp achieves on average 47% and up to 83% higher memory reduction rate compared to existing methods.
arXiv Detail & Related papers (2025-08-30T18:25:19Z) - KV-Distill: Nearly Lossless Learnable Context Compression for LLMs [37.0803484148612]
We introduce KV-Distill, a Transformer compression framework that distills long context KV caches into significantly shorter representations.<n> KV-Distill can be trained as a parameter-efficient adaptor for pretrained models.<n>It can be fine-tuned on domain-specific contexts to reduce lengths by up to 99% while preserving downstream performance.
arXiv Detail & Related papers (2025-03-13T13:15:28Z) - QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache [67.84112700032007]
Large Language Models (LLMs) are increasingly being deployed on edge devices for long-context settings.<n>In these scenarios, the Key-Value ( KV) cache is the primary bottleneck in terms of both GPU memory and latency.<n>We propose a novel self-speculative decoding framework, QuantSpec, where the draft model shares the architecture of the target model but employs a hierarchical 4-bit quantized KV cache and 4-bit quantized weights for acceleration.
arXiv Detail & Related papers (2025-02-05T20:43:48Z) - FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation [14.33163594016033]
Large language models (LLMs) require substantial prefill computation and key-value ( KV) cache.<n>Recent works that compress KV caches with prefill acceleration reduce this cost but inadvertently tie the prefill compute reduction to the decoding KV budget.<n>FastKV is a KV cache compression framework designed to reduce latency in both prefill and decoding by leveraging the stabilization of token importance in later layers.
arXiv Detail & Related papers (2025-02-03T05:25:09Z) - ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference [61.412894960600205]
Large Language Models (LLMs) require significant GPU memory when processing long texts.<n>ChunkKV reimagines KV cache compression by treating semantic chunks as basic compression units.<n>Result: ChunkKV outperforms state-of-the-art methods by up to 8.7% in precision.
arXiv Detail & Related papers (2025-02-01T03:49:47Z) - SCBench: A KV Cache-Centric Analysis of Long-Context Methods [61.025422435235456]
We introduce SCBench, a benchmark for evaluating long-context methods from a KV cachecentric perspective.<n>We provide an extensive KV cache-centric analysis of eight categories long-context solutions, including Gated Linear RNNs and Mamba-Attention hybrids.<n>Our findings show that sub-O(n) memory methods suffer in multi-turn scenarios, while sparse encoding with O(n) memory and sub-O(n2) pre-filling perform robustly.
arXiv Detail & Related papers (2024-12-13T17:59:52Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.