Efficient LLM Inference with Kcache
- URL: http://arxiv.org/abs/2404.18057v1
- Date: Sun, 28 Apr 2024 03:11:42 GMT
- Title: Efficient LLM Inference with Kcache
- Authors: Qiaozhi He, Zhihua Wu,
- Abstract summary: Large Language Models (LLMs) have had a profound impact on AI applications.
KV Cache technology is one of the most widely used techniques in the industry.
We propose a novel KCache technique to alleviate the memory bottleneck issue during the LLMs inference process.
- Score: 3.945956673130761
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models(LLMs) have had a profound impact on AI applications, particularly in the domains of long-text comprehension and generation. KV Cache technology is one of the most widely used techniques in the industry. It ensures efficient sequence generation by caching previously computed KV states. However, it also introduces significant memory overhead. We discovered that KV Cache is not necessary and proposed a novel KCache technique to alleviate the memory bottleneck issue during the LLMs inference process. KCache can be used directly for inference without any training process, Our evaluations show that KCache improves the throughput of popular LLMs by 40% with the baseline, while keeping accuracy.
Related papers
- Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models.
We propose an importance-driven cache merging strategy to prune redundancy caches.
For instruction encoding, we utilize the frequency to evaluate the importance of caches.
Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z) - PQCache: Product Quantization-based KVCache for Long Context LLM Inference [27.523568511043273]
Key-Value Cache (KVCache) is a crucial component in Large Language Models (LLMs)
Current methods selectively determine suitable keys and values for self-attention in LLMs to address the issue.
We propose PQCache, which employs Product Quantization (PQ) to manage KVCache, maintaining model quality while ensuring low serving latency.
arXiv Detail & Related papers (2024-07-01T13:05:42Z) - PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling [53.08975547824068]
Pyramid KV is a novel and effective KV cache compression method.
We show that Pyramid KV matches the performance of models with a full KV cache while retaining only 12% of the KV cache.
In scenarios emphasizing memory efficiency, where only 0.7% of the KV cache is maintained, Pyramid KV surpasses other KV cache compression techniques achieving up to a 20.5 absolute accuracy improvement on TREC.
arXiv Detail & Related papers (2024-06-04T07:51:30Z) - MiniCache: KV Cache Compression in Depth Dimension for Large Language Models [48.03117580340151]
Key-Value ( KV) cache stores key-value states of previously generated tokens.
The size of the KV cache grows linearly with sequence length, posing challenges for applications requiring long context input and extensive sequence generation.
We present a simple yet effective approach, called MiniCache, to compress the KV cache across layers from a novel depth perspective.
arXiv Detail & Related papers (2024-05-23T09:43:52Z) - CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint.
CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning.
Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z) - QAQ: Quality Adaptive Quantization for LLM KV Cache [3.163526369095745]
A bottleneck in model deployment emerges due to the linear expansion of the Key-Value cache with the context length.
We propose QAQ, a Quality Adaptive Quantization scheme for the KV cache.
arXiv Detail & Related papers (2024-03-07T16:42:37Z) - Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference [78.65321721142624]
We focus on a memory bottleneck imposed by the key-value ( KV) cache.
Existing KV cache methods approach this problem by pruning or evicting large swaths of relatively less important KV pairs.
We propose LESS, a simple integration of a constant sized cache with eviction-based cache methods.
arXiv Detail & Related papers (2024-02-14T18:54:56Z) - KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache [67.9776980972508]
We develop a tuning-free 2bit KV cache quantization algorithm named KIVI.
KIVI can enable Llama, Falcon, and Mistral models to maintain almost the same quality while using $mathbf2.6times$ less peak memory.
arXiv Detail & Related papers (2024-02-05T06:06:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.