Related papers: MadaKV: Adaptive Modality-Perception KV Cache Eviction for Efficient Multimodal Long-Context Inference

MadaKV: Adaptive Modality-Perception KV Cache Eviction for Efficient Multimodal Long-Context Inference

URL: http://arxiv.org/abs/2506.15724v1
Date: Fri, 06 Jun 2025 01:51:24 GMT
Title: MadaKV: Adaptive Modality-Perception KV Cache Eviction for Efficient Multimodal Long-Context Inference
Authors: Kunxi Li, Zhonghua Jiang, Zhouzhou Shen, Zhaode Wang, Chengfei Lv, Shengyu Zhang, Fan Wu, Fei Wu,
Abstract summary: MadaKV is a modality-adaptive key-value cache eviction strategy for long-context inference.<n>It achieves substantial reductions in KV cache memory footprint and model inference decoding latency.<n>Experiments on representative MLLMs and the MileBench benchmark demonstrate the effectiveness of MadaKV.
Score: 13.069489189643441
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper introduces MadaKV, a modality-adaptive key-value (KV) cache eviction strategy designed to enhance the efficiency of multimodal large language models (MLLMs) in long-context inference. In multimodal scenarios, attention heads exhibit varying preferences for different modalities, resulting in significant disparities in modality importance across attention heads. Traditional KV cache eviction methods, which are tailored for unimodal settings, fail to capture modality-specific information, thereby yielding suboptimal performance. MadaKV addresses these challenges through two key components: modality preference adaptation and hierarchical compression compensation. By dynamically sensing modality information within attention heads and adaptively retaining critical tokens, MadaKV achieves substantial reductions in KV cache memory footprint and model inference decoding latency (1.3 to 1.5 times improvement) while maintaining high accuracy across various multimodal long-context tasks. Extensive experiments on representative MLLMs and the MileBench benchmark demonstrate the effectiveness of MadaKV compared to existing KV cache eviction methods.

Related papers

SmallKV: Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference [71.20542521694524]
SmallKV is a small model assisted compensation method for KV cache compression.<n>We show that SmallKV achieves 1.75 - 2.56 times higher throughput than baseline methods.
arXiv Detail & Related papers (2025-08-03T09:15:36Z)
IAM: Efficient Inference through Attention Mapping between Different-scale LLMs [74.81417160018856]
IAM framework achieves dual benefits of accelerated attention computation and reduced KV cache usage.<n>We show that IAM can accelerate prefill by 15% and reduce KV cache usage by 22.1% without appreciably sacrificing performance.
arXiv Detail & Related papers (2025-07-16T06:39:11Z)
TaDA: Training-free recipe for Decoding with Adaptive KV Cache Compression and Mean-centering [10.427881558469442]
We introduce TaDA, a training-free recipe for KV cache compression with quantization precision.<n>Our approach yields substantial accuracy improvements for multiple models supporting various context lengths.<n>Our method paves the way for scalable and high-performance reasoning in language models.
arXiv Detail & Related papers (2025-06-05T05:23:38Z)
MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference [15.895020720304656]
MEDA is a dynamic layer-wise KV cache allocation method for efficient multimodal long-context inference.<n> MEDA achieves up to 72% KV cache memory reduction and 2.82 times faster decoding speed.
arXiv Detail & Related papers (2025-02-24T19:34:52Z)
DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance [125.81664663201282]
We introduce a new KV cache compression method dubbed DBudgetKV.<n>It features an attention-based metric to signal when the remaining KV cache is unlikely to match the full-cache performance.<n>Our method achieves lossless KV pruning effectively and robustly, exceeding 25% compression ratio on average.
arXiv Detail & Related papers (2025-02-24T06:33:39Z)
Multi-matrix Factorization Attention [59.10039136733939]
We propose Multi-matrix Factorization Attention (MFA) and MFA-Key-Reuse (MFA-KR)<n>MFA enhances model capacity by efficiently scaling up both the number and dimension of attention heads.<n>MFA-KR further reduces memory requirements by repurposing the key cache as value.
arXiv Detail & Related papers (2024-12-26T15:45:45Z)
More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression [71.42818367729573]
In large language models (LLMs), the memory usage of KV cache has become a critical bottleneck during inference.<n>The mainstream KV compression methods, including KV pruning and KV quantization, primarily focus on either token or precision dimension separately.<n>In this paper, we comprehensively investigate the token-precision trade-off in KV cache compression.
arXiv Detail & Related papers (2024-12-17T09:20:31Z)
PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation [65.36715026409873]
Key-value (KV) cache, necessitated by the lengthy input and output sequences, notably contributes to the high inference cost.<n>We present PrefixKV, which reframes the challenge of determining KV cache sizes for all layers into the task of searching for the optimal global prefix configuration.<n>Our method achieves the state-of-the-art performance compared with others.
arXiv Detail & Related papers (2024-12-04T15:48:59Z)
ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference [32.20654044142376]
LOOK-M is a pioneering, fine-tuning-free approach that efficiently reduces the multimodal KV cache size. It achieves up to 1.5x faster decoding and also maintains or even enhances performance across a variety of long context multimodal tasks.
arXiv Detail & Related papers (2024-06-26T07:44:24Z)
D2O: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models [28.244034916473804]
Generative inference in Large Language Models (LLMs) is impeded by the growing memory demands of Key-Value (KV) cache.<n>Traditional KV cache eviction strategies discard less critical KV pairs based on attention scores, leading to issues such as context loss or hallucinations.<n>We introduce Dynamic Discriminative Operations (D2O), a KV cache compression method that optimize KV cache size dynamically and discriminatively at two levels without fine-tuning.
arXiv Detail & Related papers (2024-06-18T20:01:51Z)
QAQ: Quality Adaptive Quantization for LLM KV Cache [3.163526369095745]
A bottleneck in model deployment emerges due to the linear expansion of the Key-Value cache with the context length. We propose QAQ, a Quality Adaptive Quantization scheme for the KV cache.
arXiv Detail & Related papers (2024-03-07T16:42:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.