On the Efficacy of Eviction Policy for Key-Value Constrained Generative
Language Model Inference
- URL: http://arxiv.org/abs/2402.06262v2
- Date: Sat, 17 Feb 2024 10:08:14 GMT
- Title: On the Efficacy of Eviction Policy for Key-Value Constrained Generative
Language Model Inference
- Authors: Siyu Ren, Kenny Q. Zhu
- Abstract summary: Large Language Models (LLMs) are notably cost-prohibitive to deploy in resource-constrained environments.
We introduce RoCo, a robust cache omission policy based on temporal attention scores and robustness measures.
We release EasyKV, a versatile software package dedicated to user-friendly key-value constrained generative inference.
- Score: 40.789027180025286
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the recent success associated with Large Language Models (LLMs), they
are notably cost-prohibitive to deploy in resource-constrained environments due
to their excessive memory and computational demands. In addition to model
parameters, the key-value cache is also stored in GPU memory, growing linearly
with batch size and sequence length. As a remedy, recent works have proposed
various eviction policies for maintaining the overhead of key-value cache under
a given budget. This paper embarks on the efficacy of existing eviction
policies in terms of importance score calculation and eviction scope
construction. We identify the deficiency of prior policies in these two aspects
and introduce RoCo, a robust cache omission policy based on temporal attention
scores and robustness measures. Extensive experimentation spanning prefilling
and auto-regressive decoding stages validates the superiority of RoCo. Finally,
we release EasyKV, a versatile software package dedicated to user-friendly
key-value constrained generative inference. Code available at
https://github.com/DRSY/EasyKV.
Related papers
- Learning to Evict from Key-Value Cache [17.365511268829703]
We introduce KV Policy, a framework for learning to rank tokens by their predicted usefulness for future decoding.<n> evaluated across two different model families on the long-context benchmark RULER and the multi-turn dialogue benchmark OASST2-4k.<n>Results demonstrate that learning to predict future token utility is a powerful and scalable paradigm for adaptive KV cache management.
arXiv Detail & Related papers (2026-02-10T19:34:15Z) - FASA: Frequency-aware Sparse Attention [56.26881872333624]
We propose FASA, a novel framework that achieves query-aware token eviction by dynamically predicting token importance.<n>Our key finding is that a small, identifiable subset of "dominant" FCs consistently exhibits high contextual agreement with the full attention head.<n>Across a spectrum of long-context tasks, FASA consistently outperforms all token-eviction baselines and achieves near-oracle accuracy.
arXiv Detail & Related papers (2026-02-03T06:09:06Z) - Taming the Fragility of KV Cache Eviction in LLM Inference [36.547639886708026]
We propose a two-step, linear-time approach that controls worst-case risk, thereby defending against extreme cases with negligible computational overhead.<n>Our methods reduce generation quality loss by 2.3x and 4.3x respectively, versus the strongest baseline under a 20% cache size.
arXiv Detail & Related papers (2025-10-15T09:18:58Z) - Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction [53.83828564664595]
Large language models (LLMs) utilize key-value ( KV) cache to store historical information during sequence processing.<n>Current methods for KV cache eviction typically utilize the last window from the pre-filling phase as queries to compute the KV importance scores for eviction.<n>We propose Judge Q, a novel training method which incorporates a soft token list.
arXiv Detail & Related papers (2025-09-13T03:34:12Z) - Lag-Relative Sparse Attention In Long Context Training [8.365610885641276]
We propose Lag-Relative Sparse Attention(LRSA) anchored by the LagKV compression method for long context post-training.<n>Our method performs chunk-by-chunk prefilling, which selects the top K most relevant key-value pairs in a fixed-size lagging window.
arXiv Detail & Related papers (2025-06-13T06:49:53Z) - CacheFocus: Dynamic Cache Re-Positioning for Efficient Retrieval-Augmented Generation [6.544043376474944]
Large Language Models (LLMs) excel across a variety of language tasks yet are constrained by limited input lengths and high computational costs.
Existing approachestextemdash partially alleviate these issues but often require additional training or suffer from performance degradation with longer inputs.
We introduce textbftextitCacheFocus, a method that enhances length normalization and reduces inference latency without any further training.
arXiv Detail & Related papers (2025-02-16T12:33:16Z) - CSR:Achieving 1 Bit Key-Value Cache via Sparse Representation [63.65323577445951]
We propose a novel approach called Cache Sparse Representation (CSR)
CSR transforms the dense Key-Value cache tensor into sparse indexes and weights, offering a more memory-efficient representation during LLM inference.
Our experiments demonstrate CSR achieves performance comparable to state-of-the-art KV cache quantization algorithms.
arXiv Detail & Related papers (2024-12-16T13:01:53Z) - Anchor Attention, Small Cache: Code Generation with Large Language Models [15.94784908771546]
Current practices in NLP often use sparse attention which may, unfortunately, lead to substantial inaccuracies, or hallucinations, in code generation tasks.
We propose a novel approach, AnchorCoder, which features token-wise anchor attention designed to extract and compress contextual information.
It can consistently achieve a significant (at least 70%) reduction in KV cache requirements, while preserving the majority of model's performance.
arXiv Detail & Related papers (2024-11-11T02:47:05Z) - NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time [44.89402186438295]
Large Language Models (LLMs) have ignited an innovative surge of AI applications, marking a new era of exciting possibilities equipped with extended context windows.
However, hosting these models is cost-prohibitive mainly due to the extensive memory consumption of KV Cache involving long-context modeling.
We propose NACL, a general framework for long-context KV cache eviction that achieves more optimal and efficient eviction in a single operation during the encoding phase.
arXiv Detail & Related papers (2024-08-07T10:31:07Z) - Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models.
We propose an importance-driven cache merging strategy to prune redundancy caches.
For instruction encoding, we utilize the frequency to evaluate the importance of caches.
Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z) - Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference [19.447729423696096]
Large Language Models have excelled in various fields but encounter challenges in memory and time efficiency.
Recent efforts try to reduce KV cache size to a given memory budget by evicting vast non-critical cache elements during runtime.
arXiv Detail & Related papers (2024-07-16T09:53:32Z) - Training-Free Exponential Context Extension via Cascading KV Cache [49.608367376911694]
We introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens.
Our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
arXiv Detail & Related papers (2024-06-24T03:59:17Z) - CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint.
CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning.
Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z) - Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference [2.8241099113277666]
"Keyformer" is an innovative inference-time approach to mitigate the challenges associated with KV cache size and memory bandwidth utilization.
We evaluate Keyformer's performance across three foundational models: GPT-J, Cerebras-GPT, and MPT.
arXiv Detail & Related papers (2024-03-14T02:42:42Z) - Off-Policy Evaluation for Large Action Spaces via Policy Convolution [60.6953713877886]
Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies.
Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
arXiv Detail & Related papers (2023-10-24T01:00:01Z) - Kalman meets Bellman: Improving Policy Evaluation through Value Tracking [59.691919635037216]
Policy evaluation is a key process in Reinforcement Learning (RL)
We devise an optimization method, called Kalman Optimization for Value Approximation (KOVA)
KOVA minimizes a regularized objective function that concerns both parameter and noisy return uncertainties.
arXiv Detail & Related papers (2020-02-17T13:30:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.