Related papers: KVShare: Semantic-Aware Key-Value Cache Sharing for Efficient Large Language Model Inference

KVShare: Semantic-Aware Key-Value Cache Sharing for Efficient Large Language Model Inference

URL: http://arxiv.org/abs/2503.16525v1
Date: Mon, 17 Mar 2025 16:43:35 GMT
Title: KVShare: Semantic-Aware Key-Value Cache Sharing for Efficient Large Language Model Inference
Authors: Huan Yang, Renji Zhang, Deyu Zhang,
Abstract summary: KVShare is a multi-user Key-Value ( KV) Cache sharing technology based on semantic similarity.<n>It is designed to enhance the inference efficiency of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs)
Score: 7.894452711850396
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper presents KVShare, a multi-user Key-Value (KV) Cache sharing technology based on semantic similarity, designed to enhance the inference efficiency of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Addressing the limitations of existing prefix caching (strict text prefix matching) and semantic caching (loss of response diversity), KVShare achieves fine-grained KV cache reuse through semantic alignment algorithms and differential editing operations. Experiments on real-world user conversation datasets demonstrate that KVShare improves KV cache hit rates by over 60%, while maintaining output quality comparable to full computation (no significant degradation in BLEU and Rouge-L metrics). This approach effectively reduces GPU resource consumption and is applicable to scenarios with repetitive queries, such as healthcare and education.

Related papers

DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance [125.81664663201282]
We introduce a new KV cache compression method dubbed DBudgetKV.<n>It features an attention-based metric to signal when the remaining KV cache is unlikely to match the full-cache performance, then halting the pruning process.<n>Our method is easy to integrate within LLM inference, not only optimizing memory space, but also showing reduced inference time compared to existing methods.
arXiv Detail & Related papers (2025-02-24T06:33:39Z)
KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse [35.97391418064724]
KVLink is an approach for efficient key-value ( KV) cache reuse in large language models (LLMs)<n> KVLink introduces three key components: adjusting positional embeddings of KV cache at inference to match the global position after concatenation, using trainable special tokens to restore self-attention, and applying mixed-data fine-tuning.<n> Experiments across 7 datasets demonstrate that KVLink improves question answering accuracy by an average of 4% over state-of-the-art methods.
arXiv Detail & Related papers (2025-02-21T23:34:29Z)
SCBench: A KV Cache-Centric Analysis of Long-Context Methods [61.025422435235456]
We introduce SCBench, a benchmark for evaluating long-context methods from a KV cachecentric perspective.<n>We provide an extensive KV cache-centric analysis of eight categories long-context solutions, including Gated Linear RNNs and Mamba-Attention hybrids.<n>Our findings show that sub-O(n) memory methods suffer in multi-turn scenarios, while sparse encoding with O(n) memory and sub-O(n2) pre-filling perform robustly.
arXiv Detail & Related papers (2024-12-13T17:59:52Z)
PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation [65.36715026409873]
Key-value (KV) cache, necessitated by the lengthy input and output sequences, notably contributes to the high inference cost.<n>We present PrefixKV, which reframes the challenge of determining KV cache sizes for all layers into the task of searching for the optimal global prefix configuration.<n>Our method achieves the state-of-the-art performance compared with others.
arXiv Detail & Related papers (2024-12-04T15:48:59Z)
A Method for Building Large Language Models with Predefined KV Cache Capacity [11.710667043543545]
The Bounded-Cache Transformer (BCT) addresses the excessive memory consumption issue in traditional KV caches.<n>By dynamically updating the key-value vector sequences, the BCT achieves efficient inference within limited cache capacity.<n> Experimental results demonstrate that the BCT significantly reduces memory usage while maintaining the model's inference quality.
arXiv Detail & Related papers (2024-11-24T11:30:00Z)
ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption [66.97998742151918]
Large Language Models (LLMs) have revolutionized various industries with their advanced language comprehension. However, their efficiency is challenged by the Transformer architecture's struggle with handling long texts. KV Cache has emerged as a pivotal solution, converting the time complexity of token generation from quadratic to linear.
arXiv Detail & Related papers (2024-07-25T12:56:22Z)
LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference [32.20654044142376]
LOOK-M is a pioneering, fine-tuning-free approach that efficiently reduces the multimodal KV cache size. It achieves up to 1.5x faster decoding and also maintains or even enhances performance across a variety of long context multimodal tasks.
arXiv Detail & Related papers (2024-06-26T07:44:24Z)
CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint. CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning. Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z)
QAQ: Quality Adaptive Quantization for LLM KV Cache [3.163526369095745]
A bottleneck in model deployment emerges due to the linear expansion of the Key-Value cache with the context length. We propose QAQ, a Quality Adaptive Quantization scheme for the KV cache.
arXiv Detail & Related papers (2024-03-07T16:42:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.