InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management
- URL: http://arxiv.org/abs/2406.19707v1
- Date: Fri, 28 Jun 2024 07:41:26 GMT
- Title: InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management
- Authors: Wonbeom Lee, Jungi Lee, Junghwan Seo, Jaewoong Sim,
- Abstract summary: Transformer-based large language models (LLMs) demonstrate impressive performance across various natural language processing tasks.
serving inference for generating long contents poses a challenge due to the enormous memory footprint of the transient state.
InfiniGen is a novel KV cache management framework tailored for long-text generation.
- Score: 0.5899781520375794
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer-based large language models (LLMs) demonstrate impressive performance across various natural language processing tasks. Serving LLM inference for generating long contents, however, poses a challenge due to the enormous memory footprint of the transient state, known as the key-value (KV) cache, which scales with the sequence length and batch size. In this paper, we present InfiniGen, a novel KV cache management framework tailored for long-text generation, which synergistically works with modern offloading-based inference systems. InfiniGen leverages the key insight that a few important tokens that are essential for computing the subsequent attention layer in the Transformer can be speculated by performing a minimal rehearsal with the inputs of the current layer and part of the query weight and key cache of the subsequent layer. This allows us to prefetch only the essential KV cache entries (without fetching them all), thereby mitigating the fetch overhead from the host memory in offloading-based LLM serving systems. Our evaluation on several representative LLMs shows that InfiniGen improves the overall performance of a modern offloading-based system by up to 3.00x compared to prior KV cache management methods while offering substantially better model accuracy.
Related papers
- LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.
This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.
We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models.
We propose an importance-driven cache merging strategy to prune redundancy caches.
For instruction encoding, we utilize the frequency to evaluate the importance of caches.
Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z) - Training-Free Exponential Context Extension via Cascading KV Cache [49.608367376911694]
We introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens.
Our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
arXiv Detail & Related papers (2024-06-24T03:59:17Z) - Layer-Condensed KV Cache for Efficient Inference of Large Language Models [44.24593677113768]
We propose a novel method that only computes and caches the KVs of a small number of layers.
Our method achieves up to 26$times$ higher throughput than standard transformers.
arXiv Detail & Related papers (2024-05-17T08:59:46Z) - CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint.
CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning.
Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z) - Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference [2.8241099113277666]
"Keyformer" is an innovative inference-time approach to mitigate the challenges associated with KV cache size and memory bandwidth utilization.
We evaluate Keyformer's performance across three foundational models: GPT-J, Cerebras-GPT, and MPT.
arXiv Detail & Related papers (2024-03-14T02:42:42Z) - SubGen: Token Generation in Sublinear Time and Memory [48.35076900702408]
Large language models (LLMs) have extensive memory requirements for token generation.
In this work, we focus on developing an efficient compression technique for the KV cache.
We have devised a novel caching method with sublinear complexity, employing online clustering on key tokens and online $ell$ sampling on values.
Not only does this algorithm ensure a sublinear memory footprint and sublinear time complexity, but we also establish a tight error bound for our approach.
arXiv Detail & Related papers (2024-02-08T22:17:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.