PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference
- URL: http://arxiv.org/abs/2405.12532v2
- Date: Wed, 5 Jun 2024 09:01:24 GMT
- Title: PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference
- Authors: Dongjie Yang, XiaoDong Han, Yan Gao, Yao Hu, Shilin Zhang, Hai Zhao,
- Abstract summary: Large Language Models (LLMs) have shown remarkable comprehension abilities but face challenges in GPU memory usage during inference.
We propose PyramidInfer, a method that compresses the KV cache by layer-wise retaining crucial context.
PyramidInfer improves 2.2x throughput compared to Accelerate with over 54% GPU memory reduction in KV cache.
- Score: 57.53291046180288
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have shown remarkable comprehension abilities but face challenges in GPU memory usage during inference, hindering their scalability for real-time applications like chatbots. To accelerate inference, we store computed keys and values (KV cache) in the GPU memory. Existing methods study the KV cache compression to reduce memory by pruning the pre-computed KV cache. However, they neglect the inter-layer dependency between layers and huge memory consumption in pre-computation. To explore these deficiencies, we find that the number of crucial keys and values that influence future generations decreases layer by layer and we can extract them by the consistency in attention weights. Based on the findings, we propose PyramidInfer, a method that compresses the KV cache by layer-wise retaining crucial context. PyramidInfer saves significant memory by computing fewer keys and values without sacrificing performance. Experimental results show PyramidInfer improves 2.2x throughput compared to Accelerate with over 54% GPU memory reduction in KV cache.
Related papers
- PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling [53.08975547824068]
Pyramid KV is a novel and effective KV cache compression method.
We show that Pyramid KV matches the performance of models with a full KV cache while retaining only 12% of the KV cache.
In scenarios emphasizing memory efficiency, where only 0.7% of the KV cache is maintained, Pyramid KV surpasses other KV cache compression techniques achieving up to a 20.5 absolute accuracy improvement on TREC.
arXiv Detail & Related papers (2024-06-04T07:51:30Z) - MiniCache: KV Cache Compression in Depth Dimension for Large Language Models [48.03117580340151]
Key-Value ( KV) cache stores key-value states of previously generated tokens.
The size of the KV cache grows linearly with sequence length, posing challenges for applications requiring long context input and extensive sequence generation.
We present a simple yet effective approach, called MiniCache, to compress the KV cache across layers from a novel depth perspective.
arXiv Detail & Related papers (2024-05-23T09:43:52Z) - CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint.
CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning.
Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z) - GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless
Generative Inference of LLM [39.77567916589569]
Key-value (KV) caching has become the de-facto to accelerate generation speed for large language models (LLMs) inference.
Existing methods rely on dropping unimportant tokens or quantizing all entries uniformly.
We propose GEAR, an efficient KV cache compression framework that achieves near-lossless high-ratio compression.
arXiv Detail & Related papers (2024-03-08T18:48:30Z) - Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference [78.65321721142624]
We focus on a memory bottleneck imposed by the key-value ( KV) cache.
Existing KV cache methods approach this problem by pruning or evicting large swaths of relatively less important KV pairs.
We propose LESS, a simple integration of a constant sized cache with eviction-based cache methods.
arXiv Detail & Related papers (2024-02-14T18:54:56Z) - KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache [67.9776980972508]
We develop a tuning-free 2bit KV cache quantization algorithm named KIVI.
KIVI can enable Llama, Falcon, and Mistral models to maintain almost the same quality while using $mathbf2.6times$ less peak memory.
arXiv Detail & Related papers (2024-02-05T06:06:47Z) - LoMA: Lossless Compressed Memory Attention [0.0]
Lossless Compressed Memory Attention (LoMA) is a novel approach to reduce memory and computational demands during autoregressive generation.
LoMA incorporates a specialized training or fine-tuning precedure alongside an autoregressive generation algorithm optimized for the compressed context.
Experimental validation has demonstrated that LoMA significantly reducing computational consumption and memory usage.
arXiv Detail & Related papers (2024-01-16T09:18:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.