Related papers: Understanding the Physics of Key-Value Cache Compression for LLMs through Attention Dynamics

Understanding the Physics of Key-Value Cache Compression for LLMs through Attention Dynamics

URL: http://arxiv.org/abs/2603.01426v1
Date: Mon, 02 Mar 2026 04:16:36 GMT
Title: Understanding the Physics of Key-Value Cache Compression for LLMs through Attention Dynamics
Authors: Samhruth Ananthanarayanan, Ayan Sengupta, Tanmoy Chakraborty,
Abstract summary: We propose a physics-inspired view of KV compression as a controlled perturbation of token-level routing.<n>We find that moderate compression degrades internal representations with little accuracy loss, revealing redundancy.<n>We identify representational rigidity, where excessive head-level consensus collapses routing flexibility despite token survival.
Score: 22.98826013817833
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As context windows in LLMs scale to 100K+ tokens, the key-value (KV) cache becomes the dominant memory bottleneck, with recent methods claiming 80-90% savings and minimal benchmark degradation. We argue these evaluations miss a structural issue: attention is not just storage but routing, and retaining KV pairs does not guarantee semantic accessibility. We propose a physics-inspired view of KV compression as a controlled perturbation of token-level routing, distinguishing retention, accessibility, and utilization. Using synthetic tasks probing multi-entity tracking, disambiguation, coreference, and multi-hop reasoning, we find that moderate compression degrades internal representations with little accuracy loss, revealing redundancy; all models exhibit a sharp hallucination safety cliff near 90% compression, correlated with spikes in Global Eviction Ratio (GER), suggesting a phase transition in semantic reachability; and architectures differ in routing dynamics, with LLaMA showing early consensus and late diversification, and Qwen showing funnel-like late convergence, leading to distinct resilience profiles. Beyond erasure, we identify representational rigidity, where excessive head-level consensus collapses routing flexibility despite token survival. These results suggest sparse token-route structures govern compression tolerance, reframing KV compression as a structural probe of attention geometry and linking long-context scalability to sparsity and the lottery ticket hypothesis in self-attention.

Related papers

Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction [50.99402504483692]
We propose a novel gating-based KV cache eviction method for frozen-weight language models.<n>Our approach integrates seamlessly into both the prefill and decoding stages.<n>Experiments show that our method maintains near-lossless performance while evicting up to 70% of the KV cache.
arXiv Detail & Related papers (2026-01-25T03:07:54Z)
OjaKV: Context-Aware Online Low-Rank KV Cache Compression with Oja's Rule [54.37983890753086]
We introduce OjaKV, a framework that integrates a strategic hybrid storage policy with online subspace adaptation.<n>OjaKV preserves crucial first and most recent tokens in full-rank, maintaining high-fidelity anchors for attention.<n>It applies low-rank compression by incrementally adapting the projection basis using Oja's algorithm for online principal component analysis.
arXiv Detail & Related papers (2025-09-25T21:42:27Z)
KVCompose: Efficient Structured KV Cache Compression with Composite Tokens [7.922206020386125]
Large language models (LLMs) rely on key-value (KV) caches for efficient autoregressive decoding.<n>We propose a simple, yet effective, KV cache compression framework based on attention-guided, layer-adaptive composite tokens.<n>Our method achieves significant memory reduction while preserving accuracy, consistently outperforming prior structured and semi-structured methods.
arXiv Detail & Related papers (2025-09-05T14:58:24Z)
CommonKV: Compressing KV Cache with Cross-layer Parameter Sharing [54.34080239841088]
CommonKV is a training-free method for cross-layer KV cache compression through adjacent parameters sharing.<n>We show that the proposed method consistently outperforms existing low-rank and cross-layer approaches at various compression ratios.
arXiv Detail & Related papers (2025-08-22T06:55:45Z)
Krul: Efficient State Restoration for Multi-turn Conversations with Dynamic Cross-layer KV Sharing [41.792908098945766]
We present Krul, a multi-turn LLM inference system that enables accurate and efficient KV cache restoration.<n>Krul selects compression strategies based on attention similarity across layer pairs and uses a recomputation-loading pipeline to restore the KV cache.<n>It achieves a 1.5x-2.68x reduction in time-to-first-token (TTFT) and a 1.33x-2.35x reduction in KV cache storage.
arXiv Detail & Related papers (2025-07-10T01:51:17Z)
ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration [69.57122277845293]
We propose ReCalKV, a post-training low-rank KV cache compression approach with tailored strategies for Keys and Values.<n>For Keys, we propose Similarity aware Recontext (HSR), which clusters structurally similar heads into groups, enabling more accurate low-rank approximation.<n>For Values, we propose Offline Head-wise Value (OVC), which efficiently calibrates the value projection matrix using calibration data without training.
arXiv Detail & Related papers (2025-05-30T08:49:27Z)
Can LLMs Maintain Fundamental Abilities under KV Cache Compression? [29.510433427184385]
We present a benchmark KVFundaBench to evaluate the effects of KV cache compression across diverse fundamental language models.<n>We propose ShotKV, a novel compression approach that handles prefill and decoding phases while maintaining shot-level semantic coherence.
arXiv Detail & Related papers (2025-02-04T02:23:06Z)
ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference [61.412894960600205]
Large Language Models (LLMs) require significant GPU memory when processing long texts.<n>ChunkKV reimagines KV cache compression by treating semantic chunks as basic compression units.<n>Result: ChunkKV outperforms state-of-the-art methods by up to 8.7% in precision.
arXiv Detail & Related papers (2025-02-01T03:49:47Z)
EMS: Adaptive Evict-then-Merge Strategy for Head-wise KV Cache Compression Based on Global-Local Importance [44.14919492126948]
As memory overhead becomes a significant concern, efficient compression of KV cache has gained increasing attention.<n>We propose EMS to overcome these limitations, while achieving better KV cache compression under extreme compression ratios.<n> EMS consistently achieves the lowest perplexity, improves scores by over 1.28 points across four LLMs on LongBench under a 256 cache budget, and preserves 95% retrieval accuracy with a cache budget less than 2% of the context length in the Needle-in-a-Haystack task.
arXiv Detail & Related papers (2024-12-11T16:35:13Z)
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs) Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time. We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z)
UNComp: Can Matrix Entropy Uncover Sparsity? -- A Compressor Design from an Uncertainty-Aware Perspective [85.08718140718707]
UNComp is an uncertainty-aware framework that uncovers sparsity patterns that can be used for adaptive compression.<n>By focusing on uncertainty to analyze the sparsity pattern in detail, UNComp reduces the KV cache size to 4.74% of the original, achieves a 6% prefill speedup, and improves throughput by 6.4x.
arXiv Detail & Related papers (2024-10-04T02:32:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.