HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference
- URL: http://arxiv.org/abs/2601.13684v1
- Date: Tue, 20 Jan 2026 07:35:06 GMT
- Title: HeteroCache: A Dynamic Retrieval Approach to Heterogeneous KV Cache Compression for Long-Context LLM Inference
- Authors: Zhiyuan Shi, Qibo Qiu, Feng Xue, Zhonglin Jiang, Li Yu, Jian Jiang, Xiaofei He, Wenxiao Wang,
- Abstract summary: We propose HeteroCache, a training-free dynamic compression framework.<n>We show that HeteroCache achieves state-of-the-art performance on multiple long-context benchmarks and accelerates decoding by up to $3times$ compared to the original model in the 224K context.
- Score: 14.17979669446161
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The linear memory growth of the KV cache poses a significant bottleneck for LLM inference in long-context tasks. Existing static compression methods often fail to preserve globally important information, principally because they overlook the attention drift phenomenon where token significance evolves dynamically. Although recent dynamic retrieval approaches attempt to address this issue, they typically suffer from coarse-grained caching strategies and incur high I/O overhead due to frequent data transfers. To overcome these limitations, we propose HeteroCache, a training-free dynamic compression framework. Our method is built on two key insights: attention heads exhibit diverse temporal heterogeneity, and there is significant spatial redundancy among heads within the same layer. Guided by these insights, HeteroCache categorizes heads based on stability and redundancy. Consequently, we apply a fine-grained weighting strategy that allocates larger cache budgets to heads with rapidly shifting attention to capture context changes, thereby addressing the inefficiency of coarse-grained strategies. Furthermore, we employ a hierarchical storage mechanism in which a subset of representative heads monitors attention shift, and trigger an asynchronous, on-demand retrieval of contexts from the CPU, effectively hiding I/O latency. Finally, experiments demonstrate that HeteroCache achieves state-of-the-art performance on multiple long-context benchmarks and accelerates decoding by up to $3\times$ compared to the original model in the 224K context. Our code will be open-source.
Related papers
- SenCache: Accelerating Diffusion Model Inference via Sensitivity-Aware Caching [75.02865981328509]
Caching reduces computation by reusing previously computed model outputs across timesteps.<n>We propose Sensitivity-Aware Caching (SenCache), a dynamic caching policy that adaptively selects caching timesteps on a per-sample basis.<n>SenCache achieves better visual quality than existing caching methods under similar computational budgets.
arXiv Detail & Related papers (2026-02-27T17:36:09Z) - PackCache: A Training-Free Acceleration Method for Unified Autoregressive Video Generation via Compact KV-Cache [61.57938553036056]
We introduce PackCache, a training-free KV-cache management method that compacts the KV cache through three coordinated mechanisms.<n>In terms of efficiency, PackCache accelerates end-to-end generation by 1.7-2.2x on 48-frame long sequences.
arXiv Detail & Related papers (2026-01-07T19:51:06Z) - LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation [24.45300622331682]
KV Cache is commonly used to accelerate LLM inference with long contexts.<n>Existing compression methods, however, are largely and lack dynamic budget allocation.<n>We introduce a unified framework for cache compression by minimizing information loss in Transformer residual streams.
arXiv Detail & Related papers (2025-09-11T16:48:24Z) - DiCache: Let Diffusion Model Determine Its Own Cache [62.954717254728166]
DiCache is a training-free adaptive caching strategy for accelerating diffusion models at runtime.<n>Online Probe Profiling Scheme leverages a shallow-layer online probe to obtain an on-the-fly indicator for the caching error in real time.<n> Dynamic Cache Trajectory Alignment approximates the deep-layer feature output from multi-step historical caches.
arXiv Detail & Related papers (2025-08-24T13:30:00Z) - dKV-Cache: The Cache for Diffusion Language Models [53.85291644298835]
Diffusion Language Models (DLMs) have been seen as a promising competitor for autoregressive language models.<n>We propose a KV-cache-like mechanism, delayed KV-Cache, for the denoising process of DLMs.<n>Our approach is motivated by the observation that different tokens have distinct representation dynamics throughout the diffusion process.
arXiv Detail & Related papers (2025-05-21T17:32:10Z) - AirCache: Activating Inter-modal Relevancy KV Cache Compression for Efficient Large Vision-Language Model Inference [11.73134417321505]
We propose AirCache, a novel KV cache compression method aimed at accelerating LVLMs inference.<n>We show that our method achieves comparable performance to the full cache while retaining only 10% of visual KV cache.
arXiv Detail & Related papers (2025-03-31T11:13:18Z) - CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences [36.05521425453999]
Large language models (LLMs) excel at processing long sequences, boosting demand for key-value ( KV) caching.<n>We introduce Cascading and Adaptive KV cache Eviction (CAKE), a novel approach that frames KV cache eviction as a "cake-slicing problem"<n>CAKE assesses layer-specific preferences by considering attention dynamics in both spatial and temporal dimensions, allocates rational cache size for layers accordingly, and manages memory constraints in a cascading manner.
arXiv Detail & Related papers (2025-03-16T12:49:44Z) - SCBench: A KV Cache-Centric Analysis of Long-Context Methods [61.025422435235456]
We introduce SCBench, a benchmark for evaluating long-context methods from a KV cachecentric perspective.<n>We provide an extensive KV cache-centric analysis of eight categories long-context solutions, including Gated Linear RNNs and Mamba-Attention hybrids.<n>Our findings show that sub-O(n) memory methods suffer in multi-turn scenarios, while sparse encoding with O(n) memory and sub-O(n2) pre-filling perform robustly.
arXiv Detail & Related papers (2024-12-13T17:59:52Z) - Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models.
We propose an importance-driven cache merging strategy to prune redundancy caches.
For instruction encoding, we utilize the frequency to evaluate the importance of caches.
Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z) - A Learning-Based Caching Mechanism for Edge Content Delivery [2.412158290827225]
5G networks and the rise of the Internet of Things (IoT) are increasingly extending into the network edge.
This shift introduces unique challenges, particularly due to the limited cache storage and the diverse request patterns at the edge.
We introduce HR-Cache, a learning-based caching framework grounded in the principles of Hazard Rate (HR) ordering.
arXiv Detail & Related papers (2024-02-05T08:06:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.