LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference
- URL: http://arxiv.org/abs/2406.18139v1
- Date: Wed, 26 Jun 2024 07:44:24 GMT
- Title: LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference
- Authors: Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, Li Yuan,
- Abstract summary: LOOK-M is a pioneering, fine-tuning-free approach that efficiently reduces the multimodal KV cache size.
It achieves up to 1.5x faster decoding and also maintains or even enhances performance across a variety of long context multimodal tasks.
- Score: 32.20654044142376
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Long-context Multimodal Large Language Models (MLLMs) demand substantial computational resources for inference as the growth of their multimodal Key-Value (KV) cache, in response to increasing input lengths, challenges memory and time efficiency. Unlike single-modality LLMs that manage only textual contexts, the KV cache of long-context MLLMs includes representations from multiple images with temporal and spatial relationships and related textual contexts. The predominance of image tokens means traditional optimizations for LLMs' KV caches are unsuitable for multimodal long-context settings, and no prior works have addressed this challenge. In this work, we introduce LOOK-M, a pioneering, fine-tuning-free approach that efficiently reduces the multimodal KV cache size while maintaining performance comparable to a full cache. We observe that during prompt prefill, the model prioritizes more textual attention over image features, and based on the multimodal interaction observation, a new proposed text-prior method is explored to compress the KV cache. Furthermore, to mitigate the degradation of image contextual information, we propose several compensatory strategies using KV pairs merging. LOOK-M demonstrates that with a significant reduction in KV Cache memory usage, such as reducing it by 80% in some cases, it not only achieves up to 1.5x faster decoding but also maintains or even enhances performance across a variety of long context multimodal tasks.
Related papers
- VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration [7.463830743649754]
Vision-Language Models (VLMs) have demonstrated impressive performance across a versatile set of tasks.
Key-Value (KV) cache encodes long visual contexts, such as images or videos.
Existing KV cache compression methods are effective for Large Language Models (LLMs)
We propose a novel KV cache compression recipe tailored for accelerating VLM inference.
arXiv Detail & Related papers (2024-10-29T20:04:34Z) - Inference-Friendly Models With MixAttention [7.103010772135246]
MixAttention combines sliding window attention, where only a small subset of recent tokens is stored in the KV cache, with KV cache sharing across layers.
Our experiments demonstrate that MixAttention significantly reduces memory usage and improves inference speed without sacrificing model performance in both short and long-context tasks.
arXiv Detail & Related papers (2024-09-23T13:37:25Z) - LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture [18.459825048813336]
LongLLaVA is the first hybrid MLLM, which achieved a better balance between efficiency and effectiveness.
It could process nearly a thousand images on a single A100 80GB GPU, showing promising application prospects for a wide range of tasks.
arXiv Detail & Related papers (2024-09-04T17:25:21Z) - MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model [49.931663904599205]
MaVEn is an innovative framework designed to enhance the capabilities of Multimodal Large Language Models (MLLMs) in multi-image reasoning.
We show that MaVEn significantly enhances MLLMs' understanding in complex multi-image scenarios, while also improving performance in single-image contexts.
arXiv Detail & Related papers (2024-08-22T11:57:16Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.
This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.
We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models.
We propose an importance-driven cache merging strategy to prune redundancy caches.
For instruction encoding, we utilize the frequency to evaluate the importance of caches.
Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z) - Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks [21.815661269986425]
We propose a novel KV cache merging approach, called KVMerger, to achieve adaptive KV cache compression for long-context tasks.
Our approach is inspired by the intriguing observation that key states exhibit high similarity at the token level within a single sequence.
We conduct extensive experiments to demonstrate the effectiveness of KVMerger for long-context tasks under constrained memory budgets.
arXiv Detail & Related papers (2024-07-11T12:50:42Z) - Training-Free Exponential Context Extension via Cascading KV Cache [49.608367376911694]
We introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens.
Our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
arXiv Detail & Related papers (2024-06-24T03:59:17Z) - CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint.
CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning.
Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z) - Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion [70.9767518332692]
Multimodal Large Language Models (MLLMs) that incorporate LLMs with pre-trained vision models have recently demonstrated impressive performance across diverse vision-language tasks.
However, they fall short to comprehend context involving multiple images.
We propose a two phase paradigm, browse-and-concentrate, to enable in-depth multimodal context fusion.
arXiv Detail & Related papers (2024-02-19T14:59:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.