Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
- URL: http://arxiv.org/abs/2407.11550v3
- Date: Fri, 16 Aug 2024 08:46:33 GMT
- Title: Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
- Authors: Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, S. Kevin Zhou,
- Abstract summary: Large Language Models have excelled in various fields but encounter challenges in memory and time efficiency.
Recent efforts try to reduce KV cache size to a given memory budget by evicting vast non-critical cache elements during runtime.
- Score: 19.447729423696096
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models have excelled in various fields but encounter challenges in memory and time efficiency due to the expanding Key-Value (KV) cache required for long-sequence inference. Recent efforts try to reduce KV cache size to a given memory budget by evicting vast non-critical cache elements during runtime, while preserving generation quality. Our revisiting of current eviction methods reveals that they fundamentally minimize an upper bound of the $L_1$ eviction loss between the pre- and post-eviction outputs of multi-head self-attention mechanisms. Moreover, our analysis indicates that the common practices of uniformly assigning budgets across attention heads harm their post-eviction generation quality. In light of these findings, we propose a simple yet effective adaptive budget allocation algorithm. This algorithm not only optimizes the theoretical loss upper bound but also reduces the $L_1$ eviction loss in practice by aligning with the varied characteristics across different heads. By integrating this algorithm into two state-of-the-art methods, we demonstrate the effectiveness of using adaptive budget allocation to optimize KV cache eviction. Extensive evaluations on 16 datasets and the Needle-in-a-Haystack test confirm significant performance improvements across various tasks.
Related papers
- Faster WIND: Accelerating Iterative Best-of-$N$ Distillation for LLM Alignment [81.84950252537618]
This paper reveals a unified game-theoretic connection between iterative BOND and self-play alignment.
We establish a novel framework, WIN rate Dominance (WIND), with a series of efficient algorithms for regularized win rate dominance optimization.
arXiv Detail & Related papers (2024-10-28T04:47:39Z) - In-context KV-Cache Eviction for LLMs via Attention-Gate [12.732519329131392]
The KV-Cache technique has become the standard for the inference of large language models (LLMs)
This paper devises a parameterized KV-Cache eviction mechanism, dubbed as Attention-Gate.
Attention-Gate accepts the whole context as input and yields eviction flags for each token to realize in-context eviction.
arXiv Detail & Related papers (2024-10-15T05:01:19Z) - NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time [44.89402186438295]
Large Language Models (LLMs) have ignited an innovative surge of AI applications, marking a new era of exciting possibilities equipped with extended context windows.
However, hosting these models is cost-prohibitive mainly due to the extensive memory consumption of KV Cache involving long-context modeling.
We propose NACL, a general framework for long-context KV cache eviction that achieves more optimal and efficient eviction in a single operation during the encoding phase.
arXiv Detail & Related papers (2024-08-07T10:31:07Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.
This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.
We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks [21.815661269986425]
We propose a novel KV cache merging approach, called KVMerger, to achieve adaptive KV cache compression for long-context tasks.
Our approach is inspired by the intriguing observation that key states exhibit high similarity at the token level within a single sequence.
We conduct extensive experiments to demonstrate the effectiveness of KVMerger for long-context tasks under constrained memory budgets.
arXiv Detail & Related papers (2024-07-11T12:50:42Z) - D2O: Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models [14.665924387149014]
Efficient inference in Large Language Models (LLMs) is impeded by the growing memory demands of key-value (KV) caching.
Traditional KV cache eviction strategies prioritize less critical KV-pairs based on attention scores, leading to issues such as context loss or hallucinations.
We introduce Dynamic Discriminative Operations (D2O), a novel method that utilizes two-level discriminative strategies to optimize KV cache size without fine-tuning.
arXiv Detail & Related papers (2024-06-18T20:01:51Z) - CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint.
CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning.
Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z) - Improved Algorithms for Neural Active Learning [74.89097665112621]
We improve the theoretical and empirical performance of neural-network(NN)-based active learning algorithms for the non-parametric streaming setting.
We introduce two regret metrics by minimizing the population loss that are more suitable in active learning than the one used in state-of-the-art (SOTA) related work.
arXiv Detail & Related papers (2022-10-02T05:03:38Z) - Large-scale Optimization of Partial AUC in a Range of False Positive
Rates [51.12047280149546]
The area under the ROC curve (AUC) is one of the most widely used performance measures for classification models in machine learning.
We develop an efficient approximated gradient descent method based on recent practical envelope smoothing technique.
Our proposed algorithm can also be used to minimize the sum of some ranked range loss, which also lacks efficient solvers.
arXiv Detail & Related papers (2022-03-03T03:46:18Z) - Stochastic Hard Thresholding Algorithms for AUC Maximization [49.00683387735522]
We develop a hard thresholding algorithm for AUC in distributiond classification.
We conduct experiments to show the efficiency and effectiveness of the proposed algorithms.
arXiv Detail & Related papers (2020-11-04T16:49:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.