MPIC: Position-Independent Multimodal Context Caching System for Efficient MLLM Serving
- URL: http://arxiv.org/abs/2502.01960v1
- Date: Tue, 04 Feb 2025 03:13:09 GMT
- Title: MPIC: Position-Independent Multimodal Context Caching System for Efficient MLLM Serving
- Authors: Shiju Zhao, Junhao Hu, Rongxiao Huang, Jiaqi Zheng, Guihai Chen,
- Abstract summary: This paper proposes position-independent caching as a more effective approach for multimodal information management.<n>We have designed and implemented a caching system, named MPIC, to address both system-level and algorithm-level challenges.
- Score: 32.56855948056532
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The context caching technique is employed to accelerate the Multimodal Large Language Model (MLLM) inference by prevailing serving platforms currently. However, this approach merely reuses the Key-Value (KV) cache of the initial sequence of prompt, resulting in full KV cache recomputation even if the prefix differs slightly. This becomes particularly inefficient in the context of interleaved text and images, as well as multimodal retrieval-augmented generation. This paper proposes position-independent caching as a more effective approach for multimodal information management. We have designed and implemented a caching system, named MPIC, to address both system-level and algorithm-level challenges. MPIC stores the KV cache on local or remote disks when receiving multimodal data, and calculates and loads the KV cache in parallel during inference. To mitigate accuracy degradation, we have incorporated integrated reuse and recompute mechanisms within the system. The experimental results demonstrate that MPIC can achieve up to 54% reduction in response time compared to existing context caching systems, while maintaining negligible or no accuracy loss.
Related papers
- KVShare: Semantic-Aware Key-Value Cache Sharing for Efficient Large Language Model Inference [7.894452711850396]
KVShare is a multi-user Key-Value ( KV) Cache sharing technology based on semantic similarity.
It is designed to enhance the inference efficiency of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs)
arXiv Detail & Related papers (2025-03-17T16:43:35Z) - Adaptive Semantic Prompt Caching with VectorQ [78.59891542553179]
Vector similarity metrics assign a numerical score to quantify the similarity between an embedded prompt and its nearest neighbor in the cache.
Existing systems rely on a static threshold to classify whether the similarity score is sufficiently high to result in a cache hit.
We show that this one-size-fits-all threshold is insufficient across different embeddings.
We propose VectorQ, an online framework with a threshold convergence guarantee to learn embedding-specific threshold regions.
arXiv Detail & Related papers (2025-02-06T04:16:20Z) - MPCache: MPC-Friendly KV Cache Eviction for Efficient Private Large Language Model Inference [5.1206021159434805]
MPCache is built on the observation that historical tokens in a long sequence may have different effects on the downstream decoding.<n>MPCache consistently outperforms prior-art KV cache eviction baselines across different LLM generation tasks.
arXiv Detail & Related papers (2025-01-12T13:18:04Z) - Efficiently serving large multimedia models using EPD Disaggregation [21.41566892925781]
We introduce Encode-Prefill-Decode (EPD) Disaggregation, a novel framework that separates the encoding, prefill, and decode stages onto dedicated resources.<n>Our findings underscore the potential of EPD disaggregation to enable resource-efficient inference at multimodal scale.
arXiv Detail & Related papers (2024-12-25T10:11:31Z) - CSR:Achieving 1 Bit Key-Value Cache via Sparse Representation [63.65323577445951]
We propose a novel approach called Cache Sparse Representation (CSR)<n>CSR transforms the dense Key-Value cache tensor into sparse indexes and weights, offering a more memory-efficient representation during LLM inference.<n>Our experiments demonstrate CSR achieves performance comparable to state-of-the-art KV cache quantization algorithms.
arXiv Detail & Related papers (2024-12-16T13:01:53Z) - EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models [19.510078997414606]
EPIC introduces position-independent context caching for large language models.
EPIC delivers up to 8x improvements in TTFT and 7x throughput over existing systems.
arXiv Detail & Related papers (2024-10-20T08:42:29Z) - Compute Or Load KV Cache? Why Not Both? [6.982874528357836]
Cake is a novel KV cache loading system that optimally utilizes both computational and I/O resources in parallel.
Cake achieves on average 2.6x reduction in Time to First Token (TTFT) compared to compute-only and I/O-only methods.
arXiv Detail & Related papers (2024-10-04T01:11:09Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.
This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.
We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models.
We propose an importance-driven cache merging strategy to prune redundancy caches.
For instruction encoding, we utilize the frequency to evaluate the importance of caches.
Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z) - LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference [32.20654044142376]
LOOK-M is a pioneering, fine-tuning-free approach that efficiently reduces the multimodal KV cache size.
It achieves up to 1.5x faster decoding and also maintains or even enhances performance across a variety of long context multimodal tasks.
arXiv Detail & Related papers (2024-06-26T07:44:24Z) - Training-Free Exponential Context Extension via Cascading KV Cache [49.608367376911694]
We introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens.
Our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
arXiv Detail & Related papers (2024-06-24T03:59:17Z) - CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint.
CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning.
Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.