FlowMM: Cross-Modal Information Flow Guided KV Cache Merging for Efficient Multimodal Context Inference
- URL: http://arxiv.org/abs/2511.05534v2
- Date: Thu, 13 Nov 2025 14:25:08 GMT
- Title: FlowMM: Cross-Modal Information Flow Guided KV Cache Merging for Efficient Multimodal Context Inference
- Authors: Kunxi Li, Yufan Xiong, Zhonghua Jiang, Yiyun Zhou, Zhaode Wang, Chengfei Lv, Shengyu Zhang,
- Abstract summary: FlowMM is an adaptive framework for cross-modal information flow-guided multimodal KV cache merging.<n>We show that FlowMM reduces KV cache memory by 80% to 95% and decoding latency by 1.3-1.8x, while maintaining competitive task performance.
- Score: 14.674840098804184
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Traditional KV cache eviction strategies, which discard less critical KV-pairs based on attention scores, often degrade generation quality, causing context loss or hallucinations. Recent efforts shift toward KV merging, merging eviction tokens with retention tokens based on similarity. However, in multimodal scenarios, distributional biases across modality tokens and attentional biases in cross-modal interactions limit its effectiveness. This work introduces FlowMM, an adaptive framework for cross-modal information flow-guided multimodal KV cache merging. FlowMM leverages cross-modal information flow to dynamically apply layer-specific merging strategies, capturing modality-specific patterns while preserving contextual integrity. Furthermore, we introduce a sensitivity-adaptive token matching mechanism that jointly evaluates token similarity and task-critical sensitivity, merging low-risk tokens while safeguarding high-sensitivity ones. Extensive experiments across diverse leading MLLMs show that FlowMM reduces KV cache memory by 80% to 95% and decoding latency by 1.3-1.8x, while maintaining competitive task performance.
Related papers
- Understanding the Physics of Key-Value Cache Compression for LLMs through Attention Dynamics [22.98826013817833]
We propose a physics-inspired view of KV compression as a controlled perturbation of token-level routing.<n>We find that moderate compression degrades internal representations with little accuracy loss, revealing redundancy.<n>We identify representational rigidity, where excessive head-level consensus collapses routing flexibility despite token survival.
arXiv Detail & Related papers (2026-03-02T04:16:36Z) - Hierarchical Adaptive Eviction for KV Cache Management in Multimodal Language Models [8.944739362562494]
Existing KV cache eviction strategies fail to address the heterogeneous attention distributions between visual and text tokens.<n>We propose Hierarchical Adaptive Eviction (HAE), a KV cache eviction framework that optimize text-visual token interaction in MLLMs.<n>HAE minimizes KV cache usage across layers, reduces computational overhead via index broadcasting, and theoretically ensures superior information integrity and lower error bounds.
arXiv Detail & Related papers (2026-02-02T15:01:44Z) - Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction [50.99402504483692]
We propose a novel gating-based KV cache eviction method for frozen-weight language models.<n>Our approach integrates seamlessly into both the prefill and decoding stages.<n>Experiments show that our method maintains near-lossless performance while evicting up to 70% of the KV cache.
arXiv Detail & Related papers (2026-01-25T03:07:54Z) - SmallKV: Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference [71.20542521694524]
SmallKV is a small model assisted compensation method for KV cache compression.<n>We show that SmallKV achieves 1.75 - 2.56 times higher throughput than baseline methods.
arXiv Detail & Related papers (2025-08-03T09:15:36Z) - MadaKV: Adaptive Modality-Perception KV Cache Eviction for Efficient Multimodal Long-Context Inference [13.069489189643441]
MadaKV is a modality-adaptive key-value cache eviction strategy for long-context inference.<n>It achieves substantial reductions in KV cache memory footprint and model inference decoding latency.<n>Experiments on representative MLLMs and the MileBench benchmark demonstrate the effectiveness of MadaKV.
arXiv Detail & Related papers (2025-06-06T01:51:24Z) - Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding [51.711605076319216]
Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities.<n>We introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop.<n>We propose a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency violations and maintaining generation quality.
arXiv Detail & Related papers (2025-05-28T17:39:15Z) - FlowKV: Enhancing Multi-Turn Conversational Coherence in LLMs via Isolated Key-Value Cache Management [48.904743679691414]
FlowKV is a novel multi-turn isolation mechanism for KV Cache management.<n>It preserves the accumulated compressed KV cache from past turns.<n>It prevents the re-compression of older context and thereby mitigating catastrophic forgetting.
arXiv Detail & Related papers (2025-05-21T10:20:46Z) - MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference [15.895020720304656]
MEDA is a dynamic layer-wise KV cache allocation method for efficient multimodal long-context inference.<n> MEDA achieves up to 72% KV cache memory reduction and 2.82 times faster decoding speed.
arXiv Detail & Related papers (2025-02-24T19:34:52Z) - More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression [71.42818367729573]
In large language models (LLMs), the memory usage of KV cache has become a critical bottleneck during inference.<n>The mainstream KV compression methods, including KV pruning and KV quantization, primarily focus on either token or precision dimension separately.<n>In this paper, we comprehensively investigate the token-precision trade-off in KV cache compression.
arXiv Detail & Related papers (2024-12-17T09:20:31Z) - Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks [21.815661269986425]
We propose a novel KV cache merging approach, called KVMerger, to achieve adaptive KV cache compression for long-context tasks.
Our approach is inspired by the intriguing observation that key states exhibit high similarity at the token level within a single sequence.
We conduct extensive experiments to demonstrate the effectiveness of KVMerger for long-context tasks under constrained memory budgets.
arXiv Detail & Related papers (2024-07-11T12:50:42Z) - Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical
Fusion for Multimodal Affect Recognition [69.32305810128994]
Incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition.
We propose the Hierarchical Crossmodal Transformer with Dynamic Modality Gating (HCT-DMG), a lightweight incongruity-aware model.
HCT-DMG: 1) outperforms previous multimodal models with a reduced size of approximately 0.8M parameters; 2) recognizes hard samples where incongruity makes affect recognition difficult; 3) mitigates the incongruity at the latent level in crossmodal attention.
arXiv Detail & Related papers (2023-05-23T01:24:15Z) - VERITE: A Robust Benchmark for Multimodal Misinformation Detection
Accounting for Unimodal Bias [17.107961913114778]
multimodal misinformation is a growing problem on social media platforms.
In this study, we investigate and identify the presence of unimodal bias in widely-used MMD benchmarks.
We introduce a new method -- termed Crossmodal HArd Synthetic MisAlignment (CHASMA) -- for generating realistic synthetic training data.
arXiv Detail & Related papers (2023-04-27T12:28:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.