Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models
- URL: http://arxiv.org/abs/2510.20707v1
- Date: Thu, 23 Oct 2025 16:17:47 GMT
- Title: Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models
- Authors: Xuyang Liu, Xiyan Gui, Yuchao Zhang, Linfeng Zhang,
- Abstract summary: textttMixKV is a novel method that mixes importance with diversity for optimized KV cache compression in vision-language models.<n>Under extreme compression, textttMixKV improves baseline methods by an average of textbf5.1% across five multi-modal understanding benchmarks.
- Score: 14.603288559638614
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent large vision-language models (LVLMs) demonstrate remarkable capabilities in processing extended multi-modal sequences, yet the resulting key-value (KV) cache expansion creates a critical memory bottleneck that fundamentally limits deployment scalability. While existing KV cache compression methods focus on retaining high-importance KV pairs to minimize storage, they often overlook the modality-specific semantic redundancy patterns that emerge distinctively in multi-modal KV caches. In this work, we first analyze how, beyond simple importance, the KV cache in LVLMs exhibits varying levels of redundancy across attention heads. We show that relying solely on importance can only cover a subset of the full KV cache information distribution, leading to potential loss of semantic coverage. To address this, we propose \texttt{MixKV}, a novel method that mixes importance with diversity for optimized KV cache compression in LVLMs. \texttt{MixKV} adapts to head-wise semantic redundancy, selectively balancing diversity and importance when compressing KV pairs. Extensive experiments demonstrate that \texttt{MixKV} consistently enhances existing methods across multiple LVLMs. Under extreme compression (budget=64), \texttt{MixKV} improves baseline methods by an average of \textbf{5.1\%} across five multi-modal understanding benchmarks and achieves remarkable gains of \textbf{8.0\%} and \textbf{9.0\%} for SnapKV and AdaKV on GUI grounding tasks, all while maintaining comparable inference efficiency. Furthermore, \texttt{MixKV} extends seamlessly to LLMs with comparable performance gains. Our code is available at \href{https://github.com/xuyang-liu16/MixKV}{\textcolor{citeblue}{https://github.com/xuyang-liu16/MixKV}}.
Related papers
- DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity [50.52392445266824]
We propose a residual-based KV cache compression framework motivated by long-range inter-token similarity and highly shared latent components in KV representations.<n>Instead of discarding tokens, DeltaKV encodes semantic residuals relative to retrieved historical references, preserving fidelity while substantially reducing storage.<n>Experiments show that DeltaKV reduces KV cache memory to 29% of the original while maintaining near-lossless accuracy on LongBench, SCBench, and AIME.
arXiv Detail & Related papers (2026-02-08T15:14:36Z) - PackKV: Reducing KV Cache Memory Footprint through LLM-Aware Lossy Compression [8.427136461713706]
We present textbfPackKV, a generic and efficient KV cache management framework.<n>PackKV supports both latency-critical and throughput-critical inference scenarios.
arXiv Detail & Related papers (2025-12-30T20:05:32Z) - Revisiting Multimodal KV Cache Compression: A Frequency-Domain-Guided Outlier-KV-Aware Approach [9.778764951947016]
Multimodal large language models suffer from substantial inference overhead since KV Cache grows proportionally with input length.<n>Existing multimodal KV Cache compression methods mostly rely on attention score to reduce cache size.<n>We propose FlashCache, a frequency-domain-guided, Outlier- KV-aware KV Cache compression framework.
arXiv Detail & Related papers (2025-11-20T20:25:34Z) - CommonKV: Compressing KV Cache with Cross-layer Parameter Sharing [54.34080239841088]
CommonKV is a training-free method for cross-layer KV cache compression through adjacent parameters sharing.<n>We show that the proposed method consistently outperforms existing low-rank and cross-layer approaches at various compression ratios.
arXiv Detail & Related papers (2025-08-22T06:55:45Z) - KVmix: Gradient-Based Layer Importance-Aware Mixed-Precision Quantization for KV Cache [13.662270631753135]
Quantization can effectively alleviate the memory pressure caused by KV Cache.<n>We propose a novel mixed-precision quantization method for KV Cache named KVmix.
arXiv Detail & Related papers (2025-05-18T07:04:53Z) - KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference [16.53643930310808]
KeepKV is a novel adaptive KV cache merging method designed to eliminate output perturbation while preserving performance under strict memory constraints.<n>We show that KeepKV substantially reduces memory usage, enhances inference throughput by more than 2x and keeps superior generation quality even with 10% KV cache budgets.
arXiv Detail & Related papers (2025-04-14T06:58:00Z) - KVShare: An LLM Service System with Efficient and Effective Multi-Tenant KV Cache Reuse [17.301289617498448]
We present a KV cache management module that shares the KV cache across requests under multi-tenant scenarios.<n> KVShare reduces TTFT by up to 9.39x and increases 1.2x of the throughput compared to the full KV recompute.<n> KVShare achieves 20.38% boost in terms of accuracy compared to SOTA methods.
arXiv Detail & Related papers (2025-03-17T16:43:35Z) - DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance [125.81664663201282]
We introduce a new KV cache compression method dubbed DBudgetKV.<n>It features an attention-based metric to signal when the remaining KV cache is unlikely to match the full-cache performance.<n>Our method achieves lossless KV pruning effectively and robustly, exceeding 25% compression ratio on average.
arXiv Detail & Related papers (2025-02-24T06:33:39Z) - ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference [61.412894960600205]
Large Language Models (LLMs) require significant GPU memory when processing long texts.<n>ChunkKV reimagines KV cache compression by treating semantic chunks as basic compression units.<n>Result: ChunkKV outperforms state-of-the-art methods by up to 8.7% in precision.
arXiv Detail & Related papers (2025-02-01T03:49:47Z) - DiffKV: Differentiated Memory Management for Large Language Models with Parallel KV Compaction [33.936381781692994]
DiffKV is a novel framework for efficient KV cache compression.<n>It exploits three levels of differentiation in the KV cache.<n>It is able to compress the KV cache by $2.7times$ to $5.7times$ with near-lossless accuracy.
arXiv Detail & Related papers (2024-12-04T08:51:23Z) - KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing [58.29726147780976]
We propose a plug-and-play method called textit KVSharer, which shares the KV cache between layers to achieve layer-wise compression.
Experiments show that textit KVSharer can reduce KV cache computation by 30%, thereby lowering memory consumption.
We verify that textit KVSharer is compatible with existing intra-layer KV cache compression methods, and combining both can further save memory.
arXiv Detail & Related papers (2024-10-24T08:06:41Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling [38.732413451399]
Pyramid KV is a novel and effective KV cache compression method.<n>We show that Pyramid KV matches the performance of models with a full KV cache while retaining only 12% of the KV cache.<n>In the Needle-in-a-Haystack experiment, Pyramid KV outperforms competing methods in maintaining long-context comprehension.
arXiv Detail & Related papers (2024-06-04T07:51:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.