A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference
- URL: http://arxiv.org/abs/2410.14442v1
- Date: Fri, 18 Oct 2024 13:01:14 GMT
- Title: A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference
- Authors: You Wu, Haoyi Wu, Kewei Tu,
- Abstract summary: Key-value ( KV) cache has been found effective in efficient inference of large language models (LLMs)
We propose a unified framework that covers several recent methods and their novel variants.
We find that when reducing the size of the KV cache by 2x, most configurations can achieve competitive performance to and higher throughput than standard transformers.
- Score: 41.149350870029046
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, sharing key-value (KV) cache across layers has been found effective in efficient inference of large language models (LLMs). To systematically investigate different techniques of cross-layer KV sharing, we propose a unified framework that covers several recent methods and their novel variants. We conduct comprehensive experiments on all the configurations of the framework, evaluating their generation throughput and performance in language modeling and downstream tasks. We find that when reducing the size of the KV cache by 2x, most configurations can achieve competitive performance to and higher throughput than standard transformers, but when further reducing the size of the KV cache, pairing queries of all layers with KVs of upper layers can better maintain performance, although it also introduces additional training cost and prefilling latency. We hope that this work will help users choose the appropriate approach according to their requirements and facilitate research on the acceleration of LLM inference.
Related papers
- xKV: Cross-Layer SVD for KV-Cache Compression [8.250015628919098]
Large Language Models (LLMs) with long context windows enable powerful applications but come at the cost of high memory consumption.
Recent studies attempted to merge KV-cache from multiple layers into shared representations.
We find that the dominant singular vectors are remarkably well-aligned across multiple layers of the KV-Cache.
xKV consolidates the KV-Cache of multiple layers into a shared low-rank subspace, significantly reducing KV-Cache sizes.
arXiv Detail & Related papers (2025-03-24T17:06:37Z) - WindowKV: Task-Adaptive Group-Wise KV Cache Window Selection for Efficient LLM Inference [9.572076809796448]
We propose a novel task-adaptive KV cache window selection method, WindowKV.
We show that WindowKV maintains a performance comparable to full KV cache retention while using only 12% of the original KV cache.
Our method also achieves state-of-the-art results in the Needle-in-a-Haystack evaluation, highlighting its effectiveness and robustness.
arXiv Detail & Related papers (2025-03-23T03:36:52Z) - KVShare: Semantic-Aware Key-Value Cache Sharing for Efficient Large Language Model Inference [7.894452711850396]
KVShare is a multi-user Key-Value ( KV) Cache sharing technology based on semantic similarity.
It is designed to enhance the inference efficiency of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs)
arXiv Detail & Related papers (2025-03-17T16:43:35Z) - DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance [125.81664663201282]
We introduce a new KV cache compression method dubbed DBudgetKV.
It features an attention-based metric to signal when the remaining KV cache is unlikely to match the full-cache performance, then halting the pruning process.
Our method is easy to integrate within LLM inference, not only optimizing memory space, but also showing reduced inference time compared to existing methods.
arXiv Detail & Related papers (2025-02-24T06:33:39Z) - Activation-aware Probe-Query: Effective Key-Value Retrieval for Long-Context LLMs Inference [56.71209737306054]
We propose textbfActQKV, a training-free, textbfActivation-aware approach that dynamically determines probe-textbfQuery and leverages it to retrieve the relevant textbfKV pairs for inference.
Experiments on the Long-Bench and $infty$ Benchmarks demonstrate its state-of-the-art performance with competitive inference quality and resource efficiency.
arXiv Detail & Related papers (2025-02-19T08:50:44Z) - PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation [65.36715026409873]
Key-value (KV) cache, necessitated by the lengthy input and output sequences, notably contributes to the high inference cost.
We present PrefixKV, which reframes the challenge of determining KV cache sizes for all layers into the task of searching for the optimal global prefix configuration.
Our method achieves the state-of-the-art performance compared with others.
arXiv Detail & Related papers (2024-12-04T15:48:59Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - Inference-Friendly Models With MixAttention [7.103010772135246]
MixAttention combines sliding window attention, where only a small subset of recent tokens is stored in the KV cache, with KV cache sharing across layers.
Our experiments demonstrate that MixAttention significantly reduces memory usage and improves inference speed without sacrificing model performance in both short and long-context tasks.
arXiv Detail & Related papers (2024-09-23T13:37:25Z) - CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios [13.144156413032896]
We introduce CSKV, a training-efficient Channel Shrinking technique for KV cache compression.
We show that CSKV can reduce the memory overhead of the KV cache by 80% while maintaining the model's long-context capability.
Our method can be seamlessly combined with quantization to further reduce the memory overhead, achieving a compression ratio of up to 95%.
arXiv Detail & Related papers (2024-09-16T17:36:50Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.
This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.
We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption [66.97998742151918]
Large Language Models (LLMs) have revolutionized various industries with their advanced language comprehension.
However, their efficiency is challenged by the Transformer architecture's struggle with handling long texts.
KV Cache has emerged as a pivotal solution, converting the time complexity of token generation from quadratic to linear.
arXiv Detail & Related papers (2024-07-25T12:56:22Z) - Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks [21.815661269986425]
We propose a novel KV cache merging approach, called KVMerger, to achieve adaptive KV cache compression for long-context tasks.
Our approach is inspired by the intriguing observation that key states exhibit high similarity at the token level within a single sequence.
We conduct extensive experiments to demonstrate the effectiveness of KVMerger for long-context tasks under constrained memory budgets.
arXiv Detail & Related papers (2024-07-11T12:50:42Z) - PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling [53.08975547824068]
We investigate whether attention-based information flow inside large language models (LLMs) is aggregated through noticeable patterns for long context processing.
Our observations reveal that LLMs aggregate information through Pyramidal Information Funneling where attention is scattering widely in lower layers.
Motivated by these insights, we developed Pyramid KV, a novel and effective KV cache compression method.
arXiv Detail & Related papers (2024-06-04T07:51:30Z) - Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative
Latent Attention [100.81495948184649]
We present Perceiver-VL, a vision-and-language framework that efficiently handles high-dimensional multimodal inputs such as long videos and text.
Our framework scales with linear complexity, in contrast to the quadratic complexity of self-attention used in many state-of-the-art transformer-based models.
arXiv Detail & Related papers (2022-11-21T18:22:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.