Related papers: ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition

ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition

URL: http://arxiv.org/abs/2402.15220v3
Date: Sat, 13 Jul 2024 02:53:06 GMT
Title: ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition
Authors: Lu Ye, Ze Tao, Yong Huang, Yang Li,
Abstract summary: ChunkAttention is a prefix-aware self-attention module for large language models. It can detect matching prompt prefixes across multiple requests and share their key/value tensors in memory at runtime. Experiments show that ChunkAttention can speed up the self-attention kernel by 3.2-4.8$times$ compared to the start-of-the-art implementation.
Score: 3.659659889927316
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Self-attention is an essential component of large language models (LLM) but a significant source of inference latency for long sequences. In multi-tenant LLM serving scenarios, the compute and memory operation cost of self-attention can be optimized by using the probability that multiple LLM requests have shared system prompts in prefixes. In this paper, we introduce ChunkAttention, a prefix-aware self-attention module that can detect matching prompt prefixes across multiple requests and share their key/value tensors in memory at runtime to improve the memory utilization of KV cache. This is achieved by breaking monolithic key/value tensors into smaller chunks and structuring them into the auxiliary prefix tree. Consequently, on top of the prefix-tree based KV cache, we design an efficient self-attention kernel, where a two-phase partition algorithm is implemented to improve the data locality during self-attention computation in the presence of shared system prompts. Experiments show that ChunkAttention can speed up the self-attention kernel by 3.2-4.8$\times$ compared to the start-of-the-art implementation, with the length of the system prompt ranging from 1024 to 4096.

Related papers

Paged Attention Meets FlexAttention: Unlocking Long-Context Efficiency in Deployed Inference [1.0175051111288864]
We introduce a novel integration of PagedAttention with PyTorch's FlexAttention.<n>Our benchmarks on an NVIDIA L4 GPU demonstrate significantly reduced inference latency.<n>We open-source the full implementation and discuss its implications for future long-context model deployment.
arXiv Detail & Related papers (2025-06-08T22:59:20Z)
FlashForge: Ultra-Efficient Prefix-Aware Attention for LLM Decoding [44.47821531299985]
Prefix-sharing among multiple prompts presents opportunities to combine the operations of the shared prefix.<n>Decoding is a memory-intensive process requiring heavy memory access on the key-value ( KV) cache of the prefixes.<n>We propose a dedicated attention kernel to combine the memory access of shared KV cache in the decode stage, namely FlashForge.
arXiv Detail & Related papers (2025-05-23T10:03:28Z)
Sparse Attention Remapping with Clustering for Efficient LLM Decoding on PIM [7.651654889371008]
Transformer-based models are the foundation of modern machine learning, but their execution places significant pressure on memory systems.<n> processing-in-memory (PIM) architectures are a promising solution, offering high internal bandwidth and compute parallelism near memory.<n>Current PIM designs are primarily optimized for dense attention and struggle with the dynamic, irregular access patterns introduced by modern KV cache sparsity techniques.
arXiv Detail & Related papers (2025-05-09T04:17:05Z)
MPCache: MPC-Friendly KV Cache Eviction for Efficient Private Large Language Model Inference [5.1206021159434805]
MPCache is built on the observation that historical tokens in a long sequence may have different effects on the downstream decoding. MPCache consistently outperforms prior-art KV cache eviction baselines across different LLM generation tasks.
arXiv Detail & Related papers (2025-01-12T13:18:04Z)
CSR:Achieving 1 Bit Key-Value Cache via Sparse Representation [63.65323577445951]
We propose a novel approach called Cache Sparse Representation (CSR) CSR transforms the dense Key-Value cache tensor into sparse indexes and weights, offering a more memory-efficient representation during LLM inference. Our experiments demonstrate CSR achieves performance comparable to state-of-the-art KV cache quantization algorithms.
arXiv Detail & Related papers (2024-12-16T13:01:53Z)
Squeezed Attention: Accelerating Long Context Length LLM Inference [64.11145320159126]
We propose Squeezed Attention as a mechanism to accelerate LLM applications where a large portion of the input prompt is fixed. We use K-means clustering offline to group the keys for the fixed context based on semantic similarity and represent each cluster with a single centroid value. We then compute exact attention using only these important keys from the fixed context, thereby reducing bandwidth and computational costs.
arXiv Detail & Related papers (2024-11-14T18:54:19Z)
EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models [19.510078997414606]
EPIC introduces position-independent context caching for large language models. EPIC delivers up to 8x improvements in TTFT and 7x throughput over existing systems.
arXiv Detail & Related papers (2024-10-20T08:42:29Z)
ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications. This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference. We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models. We propose an importance-driven cache merging strategy to prune redundancy caches. For instruction encoding, we utilize the frequency to evaluate the importance of caches. Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z)
vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests. Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z)
Training-Free Exponential Context Extension via Cascading KV Cache [49.608367376911694]
We introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens. Our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
arXiv Detail & Related papers (2024-06-24T03:59:17Z)
CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint. CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning. Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z)
DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference [22.684773338989007]
Large language models (LLMs) are increasingly employed for complex tasks that process multiple generation calls in a tree structure with shared prefixes of tokens. Existing inference systems for tree-based applications are inefficient due to improper partitioning of queries and KV cache during attention calculation. We propose DeFT, a hardware-efficient attention algorithm with prefix-aware and load-balanced KV cache partitions.
arXiv Detail & Related papers (2024-03-30T04:34:54Z)
Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference [2.8241099113277666]
"Keyformer" is an innovative inference-time approach to mitigate the challenges associated with KV cache size and memory bandwidth utilization. We evaluate Keyformer's performance across three foundational models: GPT-J, Cerebras-GPT, and MPT.
arXiv Detail & Related papers (2024-03-14T02:42:42Z)
Efficient Memory Management for Large Language Model Serving with PagedAttention [44.70922552274376]
High throughput serving of large language models (LLMs) requires sufficiently many requests at a time. Existing systems struggle because the key-value cache ( KV cache) memory for each request is huge and grows and shrinks dynamically. We propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems.
arXiv Detail & Related papers (2023-09-12T12:50:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.