CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion
- URL: http://arxiv.org/abs/2405.16444v2
- Date: Mon, 3 Jun 2024 10:57:57 GMT
- Title: CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion
- Authors: Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, Junchen Jiang,
- Abstract summary: Large language models (LLMs) often incorporate multiple text chunks in their inputs to provide the necessary contexts.
To speed up the prefill, one can pre-compute the KV cache of a text and re-use the KV cache when the context is reused as the prefix of another LLM input.
We present CacheBlend, a scheme that reuses the pre-computed KV caches, regardless prefix or not, and selectively recomputes the KV values of a small subset of tokens to partially update each reused KV cache.
- Score: 15.344568214955688
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) often incorporate multiple text chunks in their inputs to provide the necessary contexts. To speed up the prefill of the long LLM inputs, one can pre-compute the KV cache of a text and re-use the KV cache when the context is reused as the prefix of another LLM input. However, the reused text chunks are not always the input prefix, and when they are not, their precomputed KV caches cannot be directly used since they ignore the text's cross-attention with the preceding text in the LLM input. Thus, the benefits of reusing KV caches remain largely unrealized. This paper tackles just one question: when an LLM input contains multiple text chunks, how to quickly combine their precomputed KV caches in order to achieve the same generation quality as the expensive full prefill (i.e., without reusing KV cache)? We present CacheBlend, a scheme that reuses the pre-computed KV caches, regardless prefix or not, and selectively recomputes the KV values of a small subset of tokens to partially update each reused KV cache. In the meantime,the small extra delay for recomputing some tokens can be pipelined with the retrieval of KV caches within the same job,allowing CacheBlend to store KV caches in slower devices with more storage capacity while retrieving them without increasing the inference delay. By comparing CacheBlend with the state-of-the-art KV cache reusing schemes on three open-source LLMs of various sizes and four popular benchmark datasets of different tasks, we show that CacheBlend reduces time-to-first-token (TTFT) by 2.2-3.3X and increases the inference throughput by 2.8-5X, compared with full KV recompute, without compromising generation quality or incurring more storage cost.
Related papers
- Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models.
We propose an importance-driven cache merging strategy to prune redundancy caches.
For instruction encoding, we utilize the frequency to evaluate the importance of caches.
Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z) - Training-Free Exponential Extension of Sliding Window Context with Cascading KV Cache [49.608367376911694]
We propose a novel mechanism for storing longer sliding window contexts with the same total cache size.
We show improvements of 5.6% on long context generation (LongBench), 1.2% in streaming perplexity (PG19), and 0.6% in language understanding (MMLU STEM) using LLMs given the same fixed cache size.
arXiv Detail & Related papers (2024-06-24T03:59:17Z) - PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling [53.08975547824068]
Pyramid KV is a novel and effective KV cache compression method.
We show that Pyramid KV matches the performance of models with a full KV cache while retaining only 12% of the KV cache.
In scenarios emphasizing memory efficiency, where only 0.7% of the KV cache is maintained, Pyramid KV surpasses other KV cache compression techniques achieving up to a 20.5 absolute accuracy improvement on TREC.
arXiv Detail & Related papers (2024-06-04T07:51:30Z) - SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models [43.22490117833939]
SKVQ stands for sliding-window KV cache quantization.
S KVQ rearranges the channels of the KV cache in order to improve the similarity of channels in quantization groups.
It is possible to process context lengths of up to 1M on an 80GB memory GPU for a 7b model and up to 7 times faster decoding.
arXiv Detail & Related papers (2024-05-10T03:06:24Z) - Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention [13.041210267981613]
CachedAttention is a new attention mechanism that enables reuse of KV caches across multi-turn conversations.
It significantly decreases the time to the first token (TTFT) by up to 87%, improves the prompt prefilling throughput by up to 7.8$times$ for multi-turn conversations, and reduces the end-to-end inference cost by up to 70%.
arXiv Detail & Related papers (2024-03-23T10:42:49Z) - Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference [78.65321721142624]
We focus on a memory bottleneck imposed by the key-value ( KV) cache.
Existing KV cache methods approach this problem by pruning or evicting large swaths of relatively less important KV pairs.
We propose LESS, a simple integration of a constant sized cache with eviction-based cache methods.
arXiv Detail & Related papers (2024-02-14T18:54:56Z) - KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache [67.9776980972508]
We develop a tuning-free 2bit KV cache quantization algorithm named KIVI.
KIVI can enable Llama, Falcon, and Mistral models to maintain almost the same quality while using $mathbf2.6times$ less peak memory.
arXiv Detail & Related papers (2024-02-05T06:06:47Z) - CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving [31.766738294505767]
CacheGen is a fast context-loading module for large language models.
Uses a custom tensor encoder to encode a KV cache into compact bitstream representations.
adapts the compression level of different parts of a KV cache to cope with changes in available bandwidth.
arXiv Detail & Related papers (2023-10-11T07:08:20Z) - Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs [86.98304577162465]
We introduce adaptive KV cache compression, a plug-and-play method that reduces the memory footprint of generative inference for Large Language Models (LLMs)
We conduct targeted profiling to discern the intrinsic structure of attention modules.
Based on the recognized structure, we then construct the KV cache in an adaptive manner: evicting long-range contexts on attention heads emphasizing local contexts, discarding non-special tokens on attention heads centered on special tokens, and only employing the standard KV cache for attention heads that broadly attend to all tokens.
arXiv Detail & Related papers (2023-10-03T05:17:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.