Related papers: NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time

NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time

URL: http://arxiv.org/abs/2408.03675v2
Date: Thu, 8 Aug 2024 01:20:13 GMT
Title: NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time
Authors: Yilong Chen, Guoxia Wang, Junyuan Shang, Shiyao Cui, Zhenyu Zhang, Tingwen Liu, Shuohuan Wang, Yu Sun, Dianhai Yu, Hua Wu,
Abstract summary: Large Language Models (LLMs) have ignited an innovative surge of AI applications, marking a new era of exciting possibilities equipped with extended context windows. However, hosting these models is cost-prohibitive mainly due to the extensive memory consumption of KV Cache involving long-context modeling. We propose NACL, a general framework for long-context KV cache eviction that achieves more optimal and efficient eviction in a single operation during the encoding phase.
Score: 44.89402186438295
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large Language Models (LLMs) have ignited an innovative surge of AI applications, marking a new era of exciting possibilities equipped with extended context windows. However, hosting these models is cost-prohibitive mainly due to the extensive memory consumption of KV Cache involving long-context modeling. Despite several works proposing to evict unnecessary tokens from the KV Cache, most of them rely on the biased local statistics of accumulated attention scores and report performance using unconvincing metric like perplexity on inadequate short-text evaluation. In this paper, we propose NACL, a general framework for long-context KV cache eviction that achieves more optimal and efficient eviction in a single operation during the encoding phase. Due to NACL's efficiency, we combine more accurate attention score statistics in PROXY TOKENS EVICTION with the diversified random eviction strategy of RANDOM EVICTION, aiming to alleviate the issue of attention bias and enhance the robustness in maintaining pivotal tokens for long-context modeling tasks. Notably, our method significantly improves the performance on short- and long-text tasks by 80% and 76% respectively, reducing KV Cache by up to 50% with over 95% performance maintenance. The code is available at https://github.com/PaddlePaddle/Research/tree/master/NLP/ACL2024-NACL.

Related papers

AirCache: Activating Inter-modal Relevancy KV Cache Compression for Efficient Large Vision-Language Model Inference [11.73134417321505]
We propose AirCache, a novel KV cache compression method aimed at accelerating LVLMs inference. We show that our method achieves comparable performance to the full cache while retaining only 10% of visual KV cache.
arXiv Detail & Related papers (2025-03-31T11:13:18Z)
LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference [16.83202690345235]
We propose Self-Attention Guided Eviction(SAGE-KV), a simple and effective KV eviction cache method for long-context inference. After prefilling, our method performs a one-time top-k selection at both the token and head levels to compress the KV cache. SAGE-KV achieves 4x higher memory efficiency with improved accuracy over the static KV cache selection method StreamLLM, and 2x higher memory efficiency with better accuracy than the dynamic KV cache selection method Quest.
arXiv Detail & Related papers (2025-03-11T20:45:02Z)
Activation-aware Probe-Query: Effective Key-Value Retrieval for Long-Context LLMs Inference [56.71209737306054]
We propose textbfActQKV, a training-free, textbfActivation-aware approach that dynamically determines probe-textbfQuery and leverages it to retrieve the relevant textbfKV pairs for inference. Experiments on the Long-Bench and $infty$ Benchmarks demonstrate its state-of-the-art performance with competitive inference quality and resource efficiency.
arXiv Detail & Related papers (2025-02-19T08:50:44Z)
Cross-Self KV Cache Pruning for Efficient Vision-Language Inference [19.062950348441426]
KV cache pruning has emerged as a promising technique for reducing memory and computation costs in long-context auto-regressive generation. We propose decomposing attention scores into intra-modality attention (within the same modality) and inter-modality attention (across modalities) Our final training-free method, textbfCross-textbfSelf textbfPruning (CSP), achieves competitive performance compared to models with full KV caches.
arXiv Detail & Related papers (2024-12-05T22:47:17Z)
PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation [65.36715026409873]
Key-value (KV) cache, necessitated by the lengthy input and output sequences, notably contributes to the high inference cost. We present PrefixKV, which reframes the challenge of determining KV cache sizes for all layers into the task of searching for the optimal global prefix configuration. Our method achieves the state-of-the-art performance compared with others.
arXiv Detail & Related papers (2024-12-04T15:48:59Z)
In-context KV-Cache Eviction for LLMs via Attention-Gate [12.732519329131392]
The KV-Cache technique has become the standard for the inference of large language models (LLMs) This paper devises a parameterized KV-Cache eviction mechanism, dubbed as Attention-Gate. Attention-Gate accepts the whole context as input and yields eviction flags for each token to realize in-context eviction.
arXiv Detail & Related papers (2024-10-15T05:01:19Z)
ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression [29.163757099307553]
We present ZipVL, an efficient inference framework designed for large vision-language models (LVLMs) ZipVL resolves both computation and memory bottlenecks through a dynamic ratio allocation strategy of important tokens. Experiments demonstrate that ZipVL can accelerate the prefill phase by 2.6$times$ and reduce GPU memory usage by 50.0%.
arXiv Detail & Related papers (2024-10-11T07:24:21Z)
Eigen Attention: Attention in Low-Rank Space for KV Cache Compression [9.080678336379528]
We propose Eigen Attention, which performs the attention operation in a low-rank space, thereby reducing the KV cache memory overhead. We show that Eigen Attention results in up to 40% reduction in KV cache sizes and up to 60% reduction in attention operation latency with minimal drop in performance.
arXiv Detail & Related papers (2024-08-10T22:47:12Z)
ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications. This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference. We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models. We propose an importance-driven cache merging strategy to prune redundancy caches. For instruction encoding, we utilize the frequency to evaluate the importance of caches. Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z)
Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks [21.815661269986425]
We propose a novel KV cache merging approach, called KVMerger, to achieve adaptive KV cache compression for long-context tasks. Our approach is inspired by the intriguing observation that key states exhibit high similarity at the token level within a single sequence. We conduct extensive experiments to demonstrate the effectiveness of KVMerger for long-context tasks under constrained memory budgets.
arXiv Detail & Related papers (2024-07-11T12:50:42Z)
Training-Free Exponential Context Extension via Cascading KV Cache [49.608367376911694]
We introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens. Our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
arXiv Detail & Related papers (2024-06-24T03:59:17Z)
CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint. CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning. Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z)
Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference [78.65321721142624]
We focus on a memory bottleneck imposed by the key-value ( KV) cache. Existing KV cache methods approach this problem by pruning or evicting large swaths of relatively less important KV pairs. We propose LESS, a simple integration of a constant sized cache with eviction-based cache methods.
arXiv Detail & Related papers (2024-02-14T18:54:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.