Stateful KV Cache Management for LLMs: Balancing Space, Time, Accuracy, and Positional Fidelity
- URL: http://arxiv.org/abs/2511.04686v1
- Date: Thu, 23 Oct 2025 18:22:00 GMT
- Title: Stateful KV Cache Management for LLMs: Balancing Space, Time, Accuracy, and Positional Fidelity
- Authors: Pratik Poudel,
- Abstract summary: Key-Value (KV) cache is integral to efficient autoregressive inference in large language models (LLMs)<n>This paper examines the interplay between KV cache management strategies and the architectural context limits of models like meta-llama/Meta-Llama-3-8b-instruct.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Key-Value (KV) cache is integral to efficient autoregressive inference in large language models (LLMs), yet its unbounded growth in stateful multi-turn scenarios presents major challenges. This paper examines the interplay between KV cache management strategies, the architectural context limits of models like meta-llama/Meta-Llama-3-8b-instruct, and the often-overlooked integrity of positional encodings. Through empirical analysis using a stateful benchmarking framework, we show that LLM generation quality degrades sharply when the accumulated KV cache approaches or exceeds the model's trained context window (e.g., 8192 tokens for Llama 3), a failure mode distinct from GPU memory exhaustion. Common eviction strategies, even high-retention ones (e.g., 99% via AttentionTop), can worsen performance if they disrupt positional coherence. Because LLMs rely on consistent positional signals (e.g., RoPE), compacting a cache by removing non-contiguous tokens can scramble these signals and lead to degenerative outputs. We further show that simple strategies preserving contiguous context blocks (e.g., keeping an initial "gist") can yield more coherent generations than complex or positionally disruptive ones. We advocate for eviction techniques that respect architectural limits, preserve positional structure, and view "cache health" holistically beyond mere size.
Related papers
- From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents [78.30630000529133]
We propose MM-Mem, a pyramidal multimodal memory architecture grounded in Fuzzy-Trace Theory.<n> MM-Mem memory structures hierarchically into a Sensory Buffer, Episodic Stream, and Symbolic.<n>Experiments confirm the effectiveness of MM-Mem on both offline and streaming tasks.
arXiv Detail & Related papers (2026-03-02T05:12:45Z) - Understanding the Physics of Key-Value Cache Compression for LLMs through Attention Dynamics [22.98826013817833]
We propose a physics-inspired view of KV compression as a controlled perturbation of token-level routing.<n>We find that moderate compression degrades internal representations with little accuracy loss, revealing redundancy.<n>We identify representational rigidity, where excessive head-level consensus collapses routing flexibility despite token survival.
arXiv Detail & Related papers (2026-03-02T04:16:36Z) - Hierarchical Adaptive Eviction for KV Cache Management in Multimodal Language Models [8.944739362562494]
Existing KV cache eviction strategies fail to address the heterogeneous attention distributions between visual and text tokens.<n>We propose Hierarchical Adaptive Eviction (HAE), a KV cache eviction framework that optimize text-visual token interaction in MLLMs.<n>HAE minimizes KV cache usage across layers, reduces computational overhead via index broadcasting, and theoretically ensures superior information integrity and lower error bounds.
arXiv Detail & Related papers (2026-02-02T15:01:44Z) - HyLRA: Hybrid Layer Reuse Attention for Efficient Long-Context Inference [11.718567830546538]
Long-context inference in Large Language Models is bottlenecked by the quadratic computation complexity of attention.<n>We introduce bf HyLRA, a novel framework driven by layer-wise sparsity profiling.<n>We show that HyLRA improves inference throughput by 6%--46% while maintaining comparable performance.
arXiv Detail & Related papers (2026-01-31T15:36:17Z) - Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction [50.99402504483692]
We propose a novel gating-based KV cache eviction method for frozen-weight language models.<n>Our approach integrates seamlessly into both the prefill and decoding stages.<n>Experiments show that our method maintains near-lossless performance while evicting up to 70% of the KV cache.
arXiv Detail & Related papers (2026-01-25T03:07:54Z) - Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs [26.951325519894525]
We propose a novel approach that learns each token's intrinsic importance at creation time via a lightweight retention gate.<n>We show that it consistently outperforms strong eviction and learnable retrieval baselines, especially in low-memory regimes.<n>It even surpasses full-cache models in some settings, showing that selective retention can serve as a form of regularization.
arXiv Detail & Related papers (2025-12-03T00:20:35Z) - KVCompose: Efficient Structured KV Cache Compression with Composite Tokens [7.922206020386125]
Large language models (LLMs) rely on key-value (KV) caches for efficient autoregressive decoding.<n>We propose a simple, yet effective, KV cache compression framework based on attention-guided, layer-adaptive composite tokens.<n>Our method achieves significant memory reduction while preserving accuracy, consistently outperforming prior structured and semi-structured methods.
arXiv Detail & Related papers (2025-09-05T14:58:24Z) - Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction [72.27673320976933]
Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and parallel decoding.<n>Current caching techniques accelerate decoding by storing full-layer states, yet impose substantial memory usage.<n>We propose Sparse-dLLM, the first training-free framework integrating dynamic cache eviction with sparse attention.
arXiv Detail & Related papers (2025-08-04T16:14:03Z) - Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding [51.711605076319216]
Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities.<n>We introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop.<n>We propose a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency violations and maintaining generation quality.
arXiv Detail & Related papers (2025-05-28T17:39:15Z) - QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache [67.84112700032007]
Large Language Models (LLMs) are increasingly being deployed on edge devices for long-context settings.<n>In these scenarios, the Key-Value ( KV) cache is the primary bottleneck in terms of both GPU memory and latency.<n>We propose a novel self-speculative decoding framework, QuantSpec, where the draft model shares the architecture of the target model but employs a hierarchical 4-bit quantized KV cache and 4-bit quantized weights for acceleration.
arXiv Detail & Related papers (2025-02-05T20:43:48Z) - In-context KV-Cache Eviction for LLMs via Attention-Gate [12.732519329131392]
The KV-Cache technique has become the standard for the inference of large language models (LLMs)<n>This paper enables a novel dynamic KV-Cache eviction policy by injecting a lightweight module called Attention-Gate to the model.<n>We empirically evaluate the proposed approach across multiple scenarios, showing that effective eviction of redundant tokens can not only improve efficiency but also enhance performance.
arXiv Detail & Related papers (2024-10-15T05:01:19Z) - Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models.
We propose an importance-driven cache merging strategy to prune redundancy caches.
For instruction encoding, we utilize the frequency to evaluate the importance of caches.
Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z) - FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping [49.66872823080736]
Autoregressive Large Language Models (e.g., LLaMa, GPTs) are omnipresent achieving remarkable success in language understanding and generation.
To mitigate overload incurred during generation, several early-exit and layer-dropping strategies have been proposed.
We propose FFN-SkipLLM, which is an input-adaptive feed-forward skipping strategy.
arXiv Detail & Related papers (2024-04-05T02:35:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.