Hierarchical Adaptive Eviction for KV Cache Management in Multimodal Language Models
- URL: http://arxiv.org/abs/2602.02197v1
- Date: Mon, 02 Feb 2026 15:01:44 GMT
- Title: Hierarchical Adaptive Eviction for KV Cache Management in Multimodal Language Models
- Authors: Xindian Ma, Yidi Lu, Peng Zhang, Jing Zhang,
- Abstract summary: Existing KV cache eviction strategies fail to address the heterogeneous attention distributions between visual and text tokens.<n>We propose Hierarchical Adaptive Eviction (HAE), a KV cache eviction framework that optimize text-visual token interaction in MLLMs.<n>HAE minimizes KV cache usage across layers, reduces computational overhead via index broadcasting, and theoretically ensures superior information integrity and lower error bounds.
- Score: 8.944739362562494
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The integration of visual information into Large Language Models (LLMs) has enabled Multimodal LLMs (MLLMs), but the quadratic memory and computational costs of Transformer architectures remain a bottleneck. Existing KV cache eviction strategies fail to address the heterogeneous attention distributions between visual and text tokens, leading to suboptimal efficiency or degraded performance. In this paper, we propose Hierarchical Adaptive Eviction (HAE), a KV cache eviction framework that optimizes text-visual token interaction in MLLMs by implementing Dual-Attention Pruning during pre-filling (leveraging visual token sparsity and attention variance) and a Dynamic Decoding Eviction Strategy (inspired by OS Recycle Bins) during decoding. HAE minimizes KV cache usage across layers, reduces computational overhead via index broadcasting, and theoretically ensures superior information integrity and lower error bounds compared to greedy strategies, enhancing efficiency in both comprehension and generation tasks. Empirically, HAE reduces KV-Cache memory by 41\% with minimal accuracy loss (0.3\% drop) in image understanding tasks and accelerates story generation inference by 1.5x while maintaining output quality on Phi3.5-Vision-Instruct model.
Related papers
- ForesightKV: Optimizing KV Cache Eviction for Reasoning Models by Learning Long-Term Contribution [84.41751286055909]
We develop a training-based KV cache eviction framework that learns to predict which KV pairs to evict during longtext generations.<n>We formulate cache eviction as a Markov Decision Process and apply the GRPO algorithm to mitigate the significant language modeling loss increase on low-entropy tokens.
arXiv Detail & Related papers (2026-02-03T07:16:51Z) - Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction [50.99402504483692]
We propose a novel gating-based KV cache eviction method for frozen-weight language models.<n>Our approach integrates seamlessly into both the prefill and decoding stages.<n>Experiments show that our method maintains near-lossless performance while evicting up to 70% of the KV cache.
arXiv Detail & Related papers (2026-01-25T03:07:54Z) - Stateful KV Cache Management for LLMs: Balancing Space, Time, Accuracy, and Positional Fidelity [0.0]
Key-Value (KV) cache is integral to efficient autoregressive inference in large language models (LLMs)<n>This paper examines the interplay between KV cache management strategies and the architectural context limits of models like meta-llama/Meta-Llama-3-8b-instruct.
arXiv Detail & Related papers (2025-10-23T18:22:00Z) - KV-Efficient VLA: A Method of Speed up Vision Language Model with RNN-Gated Chunked KV Cache [0.9238700679836854]
Vision-Language-Action (VLA) models promise unified robotic perception and control, yet their scalability is constrained by the quadratic cost of attention and the unbounded growth of key-value (KV) memory during long-horizon inference.<n>We present KV-Efficient VLA, a model-agnostic memory compression framework that addresses these limitations by introducing a lightweight, training-friendly mechanism to selectively retain high-utility context.<n>Our method integrates seamlessly into existing autoregressive and hybrid VLA stacks, enabling scalable inference without modifying training pipelines or downstream control logic.
arXiv Detail & Related papers (2025-09-20T02:04:24Z) - Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction [53.83828564664595]
Large language models (LLMs) utilize key-value ( KV) cache to store historical information during sequence processing.<n>Current methods for KV cache eviction typically utilize the last window from the pre-filling phase as queries to compute the KV importance scores for eviction.<n>We propose Judge Q, a novel training method which incorporates a soft token list.
arXiv Detail & Related papers (2025-09-13T03:34:12Z) - ZSMerge: Zero-Shot KV Cache Compression for Memory-Efficient Long-Context LLMs [15.76582272387931]
We propose ZSMerge, a dynamic KV cache compression framework for efficient cache management.<n>ZSMerge significantly enhances memory efficiency and inference speed with negligible performance degradation.
arXiv Detail & Related papers (2025-03-13T03:36:03Z) - PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation [97.41972925670508]
Large vision-language models (LVLMs) incur significant computational and memory overhead during inference.<n>We present PrefixKV, where "Prefix" means the top-ranked KV based on importance rather than position in the original sequence.<n>Our method achieves the state-of-the-art performance compared with others.
arXiv Detail & Related papers (2024-12-04T15:48:59Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - D2O: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models [28.244034916473804]
Generative inference in Large Language Models (LLMs) is impeded by the growing memory demands of Key-Value (KV) cache.<n>Traditional KV cache eviction strategies discard less critical KV pairs based on attention scores, leading to issues such as context loss or hallucinations.<n>We introduce Dynamic Discriminative Operations (D2O), a KV cache compression method that optimize KV cache size dynamically and discriminatively at two levels without fine-tuning.
arXiv Detail & Related papers (2024-06-18T20:01:51Z) - CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint.
CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning.
Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.