Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads
- URL: http://arxiv.org/abs/2410.01805v1
- Date: Wed, 2 Oct 2024 17:59:52 GMT
- Title: Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads
- Authors: Yuxiang Huang, Binhang Yuan, Xu Han, Chaojun Xiao, Zhiyuan Liu,
- Abstract summary: Locret is a framework for long-context LLM inference on a single Nvidia 4090 GPU.
During inference, we evict low-importance cache units along with a chunked prefill pattern, significantly reducing peak GPU memory usage.
To our knowledge, Locret is the first framework capable of deploying Llama-3.1-8B or similar models on a single Nvidia 4090 GPU.
- Score: 30.690302709678758
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have shown remarkable advances in supporting long-context comprehension and processing tasks. However, scaling the generation inference of LLMs to such long contexts incurs significant additional computation load, and demands a substantial GPU memory footprint to maintain the key-value (KV) cache of transformer-based LLMs. Existing KV cache compression methods, such as quantization, face memory bottlenecks as context length increases, while static-sized caches, such as eviction, suffer from inefficient policies. These limitations restrict deployment on consumer-grade devices like a single Nvidia 4090 GPU. To overcome this, we propose Locret, a framework for long-context LLM inference that introduces retaining heads to evaluate the causal importance of KV cache units, allowing for more accurate eviction within a fixed cache size. Locret is fine-tuned on top of the frozen backbone LLM using a minimal amount of data from standard long-context SFT datasets. During inference, we evict low-importance cache units along with a chunked prefill pattern, significantly reducing peak GPU memory usage. We conduct an extensive empirical study to evaluate Locret, where the experimental results show that Locret outperforms the recent competitive approaches, including InfLLM, Quantization, SirLLM, and MInference, in terms of memory efficiency and the quality of generated contents -- Locret achieves over a 20x and 8x KV cache compression ratio compared to the full KV cache for Phi-3-mini-128K and Llama-3.1-8B-instruct. Additionally, Locret can be combined with other methods, such as quantization and token merging. To our knowledge, Locret is the first framework capable of deploying Llama-3.1-8B or similar models on a single Nvidia 4090 GPU, enabling 128K long-context inference without compromising generation quality, and requiring little additional system optimizations.
Related papers
- DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance [125.81664663201282]
We introduce a new KV cache compression method dubbed DBudgetKV.
It features an attention-based metric to signal when the remaining KV cache is unlikely to match the full-cache performance, then halting the pruning process.
Our method is easy to integrate within LLM inference, not only optimizing memory space, but also showing reduced inference time compared to existing methods.
arXiv Detail & Related papers (2025-02-24T06:33:39Z) - Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs [74.74225314708225]
Multi-head Latent Attention (MLA) is an innovative architecture designed to ensure efficient and economical inference.
This paper proposes the first data-efficient fine-tuning method for transitioning from Multi-Head Attention to MLA.
arXiv Detail & Related papers (2025-02-20T18:50:42Z) - QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache [67.84112700032007]
Large Language Models (LLMs) are increasingly being deployed on edge devices for long-context settings.
In these scenarios, the Key-Value ( KV) cache is the primary bottleneck in terms of both GPU memory and latency.
We propose a novel self-speculative decoding framework, QuantSpec, where the draft model shares the architecture of the target model but employs a hierarchical 4-bit quantized KV cache and 4-bit quantized weights for acceleration.
arXiv Detail & Related papers (2025-02-05T20:43:48Z) - MPCache: MPC-Friendly KV Cache Eviction for Efficient Private Large Language Model Inference [5.1206021159434805]
MPCache is built on the observation that historical tokens in a long sequence may have different effects on the downstream decoding.
MPCache consistently outperforms prior-art KV cache eviction baselines across different LLM generation tasks.
arXiv Detail & Related papers (2025-01-12T13:18:04Z) - ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference [25.638980944695728]
ShadowKV is an efficient long-context large language models (LLMs) inference system.
It stores the low-rank key cache and offloads the value cache to reduce the memory footprint for larger batch sizes and longer sequences.
It can support up to 6$times$ larger batch sizes and boost throughput by up to 3.04$times$ on an A100 GPU.
arXiv Detail & Related papers (2024-10-28T19:08:12Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference [10.115950753431528]
Large Language Models (LLMs) are a significant milestone in generative AI.
The increasing context length and batch size in offline LLM inference escalates the memory requirement of the key-value (KV) cache.
Several cost-effective solutions leverage host memory or optimized to reduce storage costs for offline inference scenarios.
We propose InstInfer, which offloads the most performance-critical computation (i.e., attention in decoding phase) and data (i.e., KV cache) parts to Computational Storage Drives (CSDs)
InstInfer improves throughput for long-sequence inference by
arXiv Detail & Related papers (2024-09-08T06:06:44Z) - NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time [44.89402186438295]
Large Language Models (LLMs) have ignited an innovative surge of AI applications, marking a new era of exciting possibilities equipped with extended context windows.
However, hosting these models is cost-prohibitive mainly due to the extensive memory consumption of KV Cache involving long-context modeling.
We propose NACL, a general framework for long-context KV cache eviction that achieves more optimal and efficient eviction in a single operation during the encoding phase.
arXiv Detail & Related papers (2024-08-07T10:31:07Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.
This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.
We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z) - Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks [21.815661269986425]
We propose a novel KV cache merging approach, called KVMerger, to achieve adaptive KV cache compression for long-context tasks.
Our approach is inspired by the intriguing observation that key states exhibit high similarity at the token level within a single sequence.
We conduct extensive experiments to demonstrate the effectiveness of KVMerger for long-context tasks under constrained memory budgets.
arXiv Detail & Related papers (2024-07-11T12:50:42Z) - PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling [53.08975547824068]
We investigate whether attention-based information flow inside large language models (LLMs) is aggregated through noticeable patterns for long context processing.
Our observations reveal that LLMs aggregate information through Pyramidal Information Funneling where attention is scattering widely in lower layers.
Motivated by these insights, we developed Pyramid KV, a novel and effective KV cache compression method.
arXiv Detail & Related papers (2024-06-04T07:51:30Z) - MiniCache: KV Cache Compression in Depth Dimension for Large Language Models [48.03117580340151]
Key-Value ( KV) cache stores key-value states of previously generated tokens.
The size of the KV cache grows linearly with sequence length, posing challenges for applications requiring long context input and extensive sequence generation.
We present a simple yet effective approach, called MiniCache, to compress the KV cache across layers from a novel depth perspective.
arXiv Detail & Related papers (2024-05-23T09:43:52Z) - KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache [67.9776980972508]
We develop a tuning-free 2bit KV cache quantization algorithm named KIVI.
KIVI can enable Llama, Falcon, and Mistral models to maintain almost the same quality while using $mathbf2.6times$ less peak memory.
arXiv Detail & Related papers (2024-02-05T06:06:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.