Related papers: XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference

XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference

URL: http://arxiv.org/abs/2412.05896v1
Date: Sun, 08 Dec 2024 11:32:08 GMT
Title: XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference
Authors: Weizhuo Li, Zhigang Wang, Yu Gu, Ge Yu,
Abstract summary: Large Language Model (LLM) inference generates output tokens one-by-one, leading to many redundant computations.<n> KV-Cache framework makes a compromise between time and space complexities.<n>Existing studies reduce memory consumption by evicting some of cached data that have less important impact on inference accuracy.<n>We show that customizing the cache size for each layer in a personalized manner can yield a significant memory reduction.
Score: 9.65524177141491
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently the generative Large Language Model (LLM) has achieved remarkable success in numerous applications. Notably its inference generates output tokens one-by-one, leading to many redundant computations. The widely-used KV-Cache framework makes a compromise between time and space complexities. However, caching data generates the increasingly growing memory demand, that can quickly exhaust the limited memory capacity of the modern accelerator like GPUs, particularly in long-context inference tasks. Existing studies reduce memory consumption by evicting some of cached data that have less important impact on inference accuracy. But the benefit in practice is far from ideal due to the static cache allocation across different LLM network layers. This paper observes that the layer-specific cached data have very different impacts on accuracy. We quantify this difference, and give experimental and theoretical validation. We accordingly make a formal analysis and shows that customizing the cache size for each layer in a personalized manner can yield a significant memory reduction, while still providing comparable accuracy. We simulate the cache allocation as a combinatorial optimization problem and give a global optimal solution. In particular, we devise a mini- and sampling-based inference over a lightweight variant of the LLM model, so as to quickly capture the difference and then feed it into the personalized algorithms. Extensive experiments on real-world datasets demonstrate that our proposals can reduce KV cache memory consumption by 61.6% on average, improve computational efficiency by 2.1x and then increase the throughput by up to 5.5x.

Related papers

DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance [125.81664663201282]
We introduce a new KV cache compression method dubbed DBudgetKV. It features an attention-based metric to signal when the remaining KV cache is unlikely to match the full-cache performance, then halting the pruning process. Our method is easy to integrate within LLM inference, not only optimizing memory space, but also showing reduced inference time compared to existing methods.
arXiv Detail & Related papers (2025-02-24T06:33:39Z)
CalibQuant: 1-Bit KV Cache Quantization for Multimodal LLMs [45.77132019859689]
CalibQuant is a visual quantization strategy that drastically reduces both memory and computational overhead. We achieve a 10x throughput increase on InternVL models.
arXiv Detail & Related papers (2025-02-15T05:08:01Z)
CSR:Achieving 1 Bit Key-Value Cache via Sparse Representation [63.65323577445951]
We propose a novel approach called Cache Sparse Representation (CSR) CSR transforms the dense Key-Value cache tensor into sparse indexes and weights, offering a more memory-efficient representation during LLM inference. Our experiments demonstrate CSR achieves performance comparable to state-of-the-art KV cache quantization algorithms.
arXiv Detail & Related papers (2024-12-16T13:01:53Z)
PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation [65.36715026409873]
Key-value (KV) cache, necessitated by the lengthy input and output sequences, notably contributes to the high inference cost.<n>We present PrefixKV, which reframes the challenge of determining KV cache sizes for all layers into the task of searching for the optimal global prefix configuration.<n>Our method achieves the state-of-the-art performance compared with others.
arXiv Detail & Related papers (2024-12-04T15:48:59Z)
InstCache: A Predictive Cache for LLM Serving [9.878166964839512]
We propose to predict user-instructions by an instruction-aligned LLM and store them in a predictive cache, so-called InstCache. Experimental results show that InstCache can achieve up to 51.34% hit rate on LMSys dataset, which corresponds to a 2x speedup, at a memory cost of only 4.5GB.
arXiv Detail & Related papers (2024-11-21T03:52:41Z)
VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration [7.463830743649754]
Vision-Language Models (VLMs) have demonstrated impressive performance across a versatile set of tasks. Key-Value (KV) cache encodes long visual contexts, such as images or videos. Existing KV cache compression methods are effective for Large Language Models (LLMs) We propose a novel KV cache compression recipe tailored for accelerating VLM inference.
arXiv Detail & Related papers (2024-10-29T20:04:34Z)
ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications. This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference. We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models. We propose an importance-driven cache merging strategy to prune redundancy caches. For instruction encoding, we utilize the frequency to evaluate the importance of caches. Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z)
MiniCache: KV Cache Compression in Depth Dimension for Large Language Models [48.03117580340151]
Key-Value ( KV) cache stores key-value states of previously generated tokens. The size of the KV cache grows linearly with sequence length, posing challenges for applications requiring long context input and extensive sequence generation. We present a simple yet effective approach, called MiniCache, to compress the KV cache across layers from a novel depth perspective.
arXiv Detail & Related papers (2024-05-23T09:43:52Z)
CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint. CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning. Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z)
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache [67.9776980972508]
We develop a tuning-free 2bit KV cache quantization algorithm named KIVI. KIVI can enable Llama, Falcon, and Mistral models to maintain almost the same quality while using $mathbf2.6times$ less peak memory.
arXiv Detail & Related papers (2024-02-05T06:06:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.