No Token Left Behind: Reliable KV Cache Compression via Importance-Aware
Mixed Precision Quantization
- URL: http://arxiv.org/abs/2402.18096v1
- Date: Wed, 28 Feb 2024 06:34:54 GMT
- Title: No Token Left Behind: Reliable KV Cache Compression via Importance-Aware
Mixed Precision Quantization
- Authors: June Yong Yang, Byeongwook Kim, Jeongin Bae, Beomseok Kwon, Gunho
Park, Eunho Yang, Se Jung Kwon, Dongsoo Lee
- Abstract summary: Key-Value (KV) Caching has become an essential technique for accelerating the inference speed and throughput of generative Large Language Models(LLMs)
- Score: 31.806112535762367
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Key-Value (KV) Caching has become an essential technique for accelerating the
inference speed and throughput of generative Large Language Models~(LLMs).
However, the memory footprint of the KV cache poses a critical bottleneck in
LLM deployment as the cache size grows with batch size and sequence length,
often surpassing even the size of the model itself. Although recent methods
were proposed to select and evict unimportant KV pairs from the cache to reduce
memory consumption, the potential ramifications of eviction on the generative
process are yet to be thoroughly examined. In this paper, we examine the
detrimental impact of cache eviction and observe that unforeseen risks arise as
the information contained in the KV pairs is exhaustively discarded, resulting
in safety breaches, hallucinations, and context loss. Surprisingly, we find
that preserving even a small amount of information contained in the evicted KV
pairs via reduced precision quantization substantially recovers the incurred
degradation. On the other hand, we observe that the important KV pairs must be
kept at a relatively higher precision to safeguard the generation quality.
Motivated by these observations, we propose \textit{Mixed-precision KV
cache}~(MiKV), a reliable cache compression method that simultaneously
preserves the context details by retaining the evicted KV pairs in
low-precision and ensure generation quality by keeping the important KV pairs
in high-precision. Experiments on diverse benchmarks and LLM backbones show
that our proposed method offers a state-of-the-art trade-off between
compression ratio and performance, compared to other baselines.
Related papers
- LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios [13.144156413032896]
We introduce CSKV, a training-efficient Channel Shrinking technique for KV cache compression.
We show that CSKV can reduce the memory overhead of the KV cache by 80% while maintaining the model's long-context capability.
Our method can be seamlessly combined with quantization to further reduce the memory overhead, achieving a compression ratio of up to 95%.
arXiv Detail & Related papers (2024-09-16T17:36:50Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.
This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.
We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression [13.981807478365452]
Existing approaches to reduce the Key-Value cache size involve either fine-tuning the model to learn a compression strategy or leveraging attention scores to reduce the sequence length.
We find a clear correlation between the $L$ and the attention scores over cached KV pairs, where a low $L$ of a key embedding leads to a high attention score during decoding.
Our experimental results show that this simple strategy can reduce the KV cache size by 50% on language modelling and needle-in-a-haystack tasks and 90% on passkey retrieval tasks without losing accuracy.
arXiv Detail & Related papers (2024-06-17T11:35:16Z) - Effectively Compress KV Heads for LLM [28.0801697946958]
We propose a novel approach for compressing Key-Value ( KV) caches.
Our method can compress half or even three-quarters of KV heads while maintaining performance comparable to the original LLMs.
arXiv Detail & Related papers (2024-06-11T08:37:33Z) - PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling [53.08975547824068]
We investigate whether attention-based information flow inside large language models (LLMs) is aggregated through noticeable patterns for long context processing.
Our observations reveal that LLMs aggregate information through Pyramidal Information Funneling where attention is scattering widely in lower layers.
Motivated by these insights, we developed Pyramid KV, a novel and effective KV cache compression method.
arXiv Detail & Related papers (2024-06-04T07:51:30Z) - Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression [87.5604418100301]
Key-value( KV) caching is an important technique to accelerate the inference of large language models.
Existing methods often compromise precision or require extra data for calibration.
We introduce textbfDecoQuant, a novel data-free low-bit quantization technique based on tensor decomposition methods.
arXiv Detail & Related papers (2024-05-21T08:35:10Z) - CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint.
CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning.
Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z) - QAQ: Quality Adaptive Quantization for LLM KV Cache [3.163526369095745]
A bottleneck in model deployment emerges due to the linear expansion of the Key-Value cache with the context length.
We propose QAQ, a Quality Adaptive Quantization scheme for the KV cache.
arXiv Detail & Related papers (2024-03-07T16:42:37Z) - Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs [82.08922896531618]
We introduce adaptive KV cache compression, a plug-and-play method that reduces the memory footprint of generative inference for Large Language Models (LLMs)
We conduct targeted profiling to discern the intrinsic structure of attention modules.
Based on the recognized structure, we then construct the KV cache in an adaptive manner: evicting long-range contexts on attention heads emphasizing local contexts, discarding non-special tokens on attention heads centered on special tokens, and only employing the standard KV cache for attention heads that broadly attend to all tokens.
arXiv Detail & Related papers (2023-10-03T05:17:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.