Related papers: Taming the Fragility of KV Cache Eviction in LLM Inference

Taming the Fragility of KV Cache Eviction in LLM Inference

URL: http://arxiv.org/abs/2510.13334v1
Date: Wed, 15 Oct 2025 09:18:58 GMT
Title: Taming the Fragility of KV Cache Eviction in LLM Inference
Authors: Yuan Feng, Haoyu Guo, JunLin Lv, S. Kevin Zhou, Xike Xie,
Abstract summary: We propose a two-step, linear-time approach that controls worst-case risk, thereby defending against extreme cases with negligible computational overhead.<n>Our methods reduce generation quality loss by 2.3x and 4.3x respectively, versus the strongest baseline under a 20% cache size.
Score: 36.547639886708026
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models have revolutionized natural language processing, yet their deployment remains hampered by the substantial memory and runtime overhead of the transformer's Key-Value cache. To mitigate this, recent methods employ a scoring-aggregation framework to evict unimportant cache entries, based on the stability assumption-that a fixed subset of entries remains consistently important during generation. However, prior work has largely focused on refining importance indicators for scoring, while defaulting to mean aggregation due to a faithful trust in the stability assumption. In this work, we argue that this underlying assumption is inherently fragile, making mean aggregation highly vulnerable in extreme cases. To counter this, we propose a simple yet elegant defensive aggregation strategy: a two-step, linear-time approach that controls worst-case risk, thereby defending against extreme cases with negligible computational overhead. Embodying this strategy, we propose a novel cache eviction method, DefensiveKV and its extension, Layer-DefensiveKV, which incorporates layer-wise budget allocation. Across seven task domains (18 datasets), our methods reduce generation quality loss by 2.3x and 4.3x respectively, versus the strongest baseline under a 20% cache size. These results set new performance benchmarks and pioneer a promising direction for optimizing cache eviction against underlying fragility through worst-case risk management. Our code is available at https://github.com/FFY0/DefensiveKV.

Related papers

CORE: Context-Robust Remasking for Diffusion Language Models [51.59514489363897]
We propose Context-Robust Remasking (CORE), a training-free framework for inference-time revision.<n>Rather than trusting static token probabilities, CORE identifies context-brittle tokens by probing their sensitivity to targeted masked-context perturbations.<n>On LLaDA-8B-Base, CORE delivers consistent improvements across reasoning and code benchmarks, outperforming compute-matched baselines and improving MBPP by up to 9.2 percentage points.
arXiv Detail & Related papers (2026-02-04T00:12:30Z)
FASA: Frequency-aware Sparse Attention [56.26881872333624]
We propose FASA, a novel framework that achieves query-aware token eviction by dynamically predicting token importance.<n>Our key finding is that a small, identifiable subset of "dominant" FCs consistently exhibits high contextual agreement with the full attention head.<n>Across a spectrum of long-context tasks, FASA consistently outperforms all token-eviction baselines and achieves near-oracle accuracy.
arXiv Detail & Related papers (2026-02-03T06:09:06Z)
From Similarity to Vulnerability: Key Collision Attack on LLM Semantic Caching [7.164841206695704]
We present the first systematic study of integrity risks arising from cache collisions.<n>We introduce CacheAttack, an automated framework for launching black-box collision attacks.<n>A case study on a financial agent illustrates the real-world impact of these vulnerabilities.
arXiv Detail & Related papers (2026-01-30T15:37:00Z)
Stateful KV Cache Management for LLMs: Balancing Space, Time, Accuracy, and Positional Fidelity [0.0]
Key-Value (KV) cache is integral to efficient autoregressive inference in large language models (LLMs)<n>This paper examines the interplay between KV cache management strategies and the architectural context limits of models like meta-llama/Meta-Llama-3-8b-instruct.
arXiv Detail & Related papers (2025-10-23T18:22:00Z)
Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction [53.83828564664595]
Large language models (LLMs) utilize key-value ( KV) cache to store historical information during sequence processing.<n>Current methods for KV cache eviction typically utilize the last window from the pre-filling phase as queries to compute the KV importance scores for eviction.<n>We propose Judge Q, a novel training method which incorporates a soft token list.
arXiv Detail & Related papers (2025-09-13T03:34:12Z)
Shadow in the Cache: Unveiling and Mitigating Privacy Risks of KV-cache in LLM Inference [17.46930265810127]
Key-Value ( KV) cache stores intermediate attention computations (Key and Value pairs) to avoid redundant calculations.<n>This paper provides the first comprehensive analysis of vulnerabilities, demonstrating that an attacker can reconstruct sensitive user inputs directly from the KV-cache.<n>We propose KV-Cloak, a novel, lightweight, and efficient defense mechanism.
arXiv Detail & Related papers (2025-08-13T02:48:25Z)
Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction [58.044803442346115]
Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and parallel decoding but suffer from prohibitive computational complexity and memory overhead during inference.<n>We propose Sparse-dLLM, the first training-free framework integrating dynamic cache eviction with sparse attention via delayed bidirectional sparse caching.
arXiv Detail & Related papers (2025-08-04T16:14:03Z)
Improving Black-Box Generative Attacks via Generator Semantic Consistency [51.470649503929344]
generative attacks produce adversarial examples in a single forward pass at test time.<n>We enforce semantic consistency by aligning the early generator's intermediate features to an EMA teacher.<n>Our approach can be seamlessly integrated into existing generative attacks with consistent improvements in black-box transfer.
arXiv Detail & Related papers (2025-06-23T02:35:09Z)
ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models. We propose an importance-driven cache merging strategy to prune redundancy caches. For instruction encoding, we utilize the frequency to evaluate the importance of caches. Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z)
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference [19.447729423696096]
Large Language Models have excelled in various domains but face efficiency challenges due to the growing Key-Value (KV) cache.<n>Recent efforts aim to reduce KV cache size by evicting vast non-critical cache elements during runtime.<n>We propose Ada-KV, the first head-wise adaptive budget allocation strategy.
arXiv Detail & Related papers (2024-07-16T09:53:32Z)
On the Efficacy of Eviction Policy for Key-Value Constrained Generative Language Model Inference [40.789027180025286]
Large Language Models (LLMs) are notably cost-prohibitive to deploy in resource-constrained environments. We introduce RoCo, a robust cache omission policy based on temporal attention scores and robustness measures. We release EasyKV, a versatile software package dedicated to user-friendly key-value constrained generative inference.
arXiv Detail & Related papers (2024-02-09T09:20:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.