A Survey on Large Language Model Acceleration based on KV Cache Management
- URL: http://arxiv.org/abs/2412.19442v2
- Date: Thu, 02 Jan 2025 03:40:15 GMT
- Title: A Survey on Large Language Model Acceleration based on KV Cache Management
- Authors: Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen,
- Abstract summary: Large Language Models (LLMs) have revolutionized a wide range of domains such as natural language processing, computer vision, and multi-modal tasks.<n>The computational and memory demands of LLMs pose significant challenges when scaling them to real-world, long-context, and real-time applications.<n>This survey provides a comprehensive overview of KV cache management strategies for LLM acceleration, categorizing them into token-level, model-level, and system-level optimizations.
- Score: 21.4802409745396
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have revolutionized a wide range of domains such as natural language processing, computer vision, and multi-modal tasks due to their ability to comprehend context and perform logical reasoning. However, the computational and memory demands of LLMs, particularly during inference, pose significant challenges when scaling them to real-world, long-context, and real-time applications. Key-Value (KV) cache management has emerged as a critical optimization technique for accelerating LLM inference by reducing redundant computations and improving memory utilization. This survey provides a comprehensive overview of KV cache management strategies for LLM acceleration, categorizing them into token-level, model-level, and system-level optimizations. Token-level strategies include KV cache selection, budget allocation, merging, quantization, and low-rank decomposition, while model-level optimizations focus on architectural innovations and attention mechanisms to enhance KV reuse. System-level approaches address memory management, scheduling, and hardware-aware designs to improve efficiency across diverse computing environments. Additionally, the survey provides an overview of both text and multimodal datasets and benchmarks used to evaluate these strategies. By presenting detailed taxonomies and comparative analyses, this work aims to offer useful insights for researchers and practitioners to support the development of efficient and scalable KV cache management techniques, contributing to the practical deployment of LLMs in real-world applications. The curated paper list for KV cache management is in: \href{https://github.com/TreeAI-Lab/Awesome-KV-Cache-Management}{https://github.com/TreeAI-Lab/Awesome-KV-Cache-Management}.
Related papers
- Quantifying Memory Utilization with Effective State-Size [73.52115209375343]
We develop a measure of textitmemory utilization'
This metric is tailored to the fundamental class of systems with textitinput-invariant and textitinput-varying linear operators
arXiv Detail & Related papers (2025-04-28T08:12:30Z) - Scaling Test-Time Inference with Policy-Optimized, Dynamic Retrieval-Augmented Generation via KV Caching and Decoding [0.0]
We present a framework for enhancing Retrieval-Augmented Generation (RAG) systems through dynamic retrieval strategies and reinforcement fine-tuning.
Our framework integrates two complementary techniques: Policy-d RetrievalAugmented Generation (PORAG) and Adaptive Token-Layer Attention Scoring (ATLAS)
Our framework reduces hallucinations, strengthens domain-specific reasoning, and achieves significant efficiency and scalability gains over traditional RAG systems.
arXiv Detail & Related papers (2025-04-02T01:16:10Z) - AirCache: Activating Inter-modal Relevancy KV Cache Compression for Efficient Large Vision-Language Model Inference [11.73134417321505]
We propose AirCache, a novel KV cache compression method aimed at accelerating LVLMs inference.
We show that our method achieves comparable performance to the full cache while retaining only 10% of visual KV cache.
arXiv Detail & Related papers (2025-03-31T11:13:18Z) - KVShare: Semantic-Aware Key-Value Cache Sharing for Efficient Large Language Model Inference [7.894452711850396]
KVShare is a multi-user Key-Value ( KV) Cache sharing technology based on semantic similarity.
It is designed to enhance the inference efficiency of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs)
arXiv Detail & Related papers (2025-03-17T16:43:35Z) - CSR:Achieving 1 Bit Key-Value Cache via Sparse Representation [63.65323577445951]
We propose a novel approach called Cache Sparse Representation (CSR)<n>CSR transforms the dense Key-Value cache tensor into sparse indexes and weights, offering a more memory-efficient representation during LLM inference.<n>Our experiments demonstrate CSR achieves performance comparable to state-of-the-art KV cache quantization algorithms.
arXiv Detail & Related papers (2024-12-16T13:01:53Z) - SCBench: A KV Cache-Centric Analysis of Long-Context Methods [61.025422435235456]
We introduce SCBench, a benchmark for evaluating long-context methods from a KV cachecentric perspective.<n>We provide an extensive KV cache-centric analysis of eight categories long-context solutions, including Gated Linear RNNs and Mamba-Attention hybrids.<n>Our findings show that sub-O(n) memory methods suffer in multi-turn scenarios, while sparse encoding with O(n) memory and sub-O(n2) pre-filling perform robustly.
arXiv Detail & Related papers (2024-12-13T17:59:52Z) - PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation [65.36715026409873]
Key-value (KV) cache, necessitated by the lengthy input and output sequences, notably contributes to the high inference cost.<n>We present PrefixKV, which reframes the challenge of determining KV cache sizes for all layers into the task of searching for the optimal global prefix configuration.<n>Our method achieves the state-of-the-art performance compared with others.
arXiv Detail & Related papers (2024-12-04T15:48:59Z) - Efficient Large Foundation Model Inference: A Perspective From Model and System Co-Design [35.40505841618305]
Large Language Models (LLMs) become popular, the need for efficient design for ML models on LLMs grows.<n>This paper focuses on modern efficient inference technologies on LLMs and illustrates them from two perspectives: model and system design.
arXiv Detail & Related papers (2024-09-03T15:35:01Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.
This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.
We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption [66.97998742151918]
Large Language Models (LLMs) have revolutionized various industries with their advanced language comprehension.
However, their efficiency is challenged by the Transformer architecture's struggle with handling long texts.
KV Cache has emerged as a pivotal solution, converting the time complexity of token generation from quadratic to linear.
arXiv Detail & Related papers (2024-07-25T12:56:22Z) - Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks [21.815661269986425]
We propose a novel KV cache merging approach, called KVMerger, to achieve adaptive KV cache compression for long-context tasks.
Our approach is inspired by the intriguing observation that key states exhibit high similarity at the token level within a single sequence.
We conduct extensive experiments to demonstrate the effectiveness of KVMerger for long-context tasks under constrained memory budgets.
arXiv Detail & Related papers (2024-07-11T12:50:42Z) - QAQ: Quality Adaptive Quantization for LLM KV Cache [3.163526369095745]
A bottleneck in model deployment emerges due to the linear expansion of the Key-Value cache with the context length.
We propose QAQ, a Quality Adaptive Quantization scheme for the KV cache.
arXiv Detail & Related papers (2024-03-07T16:42:37Z) - LLM Inference Unveiled: Survey and Roofline Model Insights [62.92811060490876]
Large Language Model (LLM) inference is rapidly evolving, presenting a unique blend of opportunities and challenges.
Our survey stands out from traditional literature reviews by not only summarizing the current state of research but also by introducing a framework based on roofline model.
This framework identifies the bottlenecks when deploying LLMs on hardware devices and provides a clear understanding of practical problems.
arXiv Detail & Related papers (2024-02-26T07:33:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.