Post-Training Sparse Attention with Double Sparsity
- URL: http://arxiv.org/abs/2408.07092v2
- Date: Sun, 18 Aug 2024 17:27:17 GMT
- Title: Post-Training Sparse Attention with Double Sparsity
- Authors: Shuo Yang, Ying Sheng, Joseph E. Gonzalez, Ion Stoica, Lianmin Zheng,
- Abstract summary: "Double Sparsity" is a novel post-training sparse attention technique designed to alleviate this bottleneck by reducing KV cache access.
Double Sparsity combines token sparsity, which focuses on utilizing only the important tokens for computing self-attention, with channel sparsity, an approach that uses important feature channels for identifying important tokens.
With offloading, it achieves a decoding speed acceleration of 16.3$times$ compared to state-of-the-art solutions at a sequence length of 256K.
- Score: 44.772593893621085
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The inference process for large language models is slow and memory-intensive, with one of the most critical bottlenecks being excessive Key-Value (KV) cache accesses. This paper introduces "Double Sparsity," a novel post-training sparse attention technique designed to alleviate this bottleneck by reducing KV cache access. Double Sparsity combines token sparsity, which focuses on utilizing only the important tokens for computing self-attention, with channel sparsity, an approach that uses important feature channels for identifying important tokens. Our key insight is that the pattern of channel sparsity is relatively static, allowing us to use offline calibration to make it efficient at runtime, thereby enabling accurate and efficient identification of important tokens. Moreover, this method can be combined with offloading to achieve significant memory usage reduction. Experimental results demonstrate that Double Sparsity can achieve $\frac{1}{16}$ token and channel sparsity with minimal impact on accuracy across various tasks, including wiki-2 perplexity, key-value retrieval, and long context benchmarks with models including Llama-2-7B, Llama-2-70B, and Mixtral-8x7B. It brings up to a 14.1$\times$ acceleration in attention operations and a 1.9$\times$ improvement in end-to-end inference on GPUs. With offloading, it achieves a decoding speed acceleration of 16.3$\times$ compared to state-of-the-art solutions at a sequence length of 256K. Our code is publicly available at https://github.com/andy-yang-1/DoubleSparse.
Related papers
- ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression [29.163757099307553]
We present ZipVL, an efficient inference framework designed for large vision-language models (LVLMs)
ZipVL resolves both computation and memory bottlenecks through a dynamic ratio allocation strategy of important tokens.
Experiments demonstrate that ZipVL can accelerate the prefill phase by 2.6$times$ and reduce GPU memory usage by 50.0%.
arXiv Detail & Related papers (2024-10-11T07:24:21Z) - Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models.
We propose an importance-driven cache merging strategy to prune redundancy caches.
For instruction encoding, we utilize the frequency to evaluate the importance of caches.
Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z) - ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification [19.985314022860432]
KV cache stores key and value states from previous tokens to avoid re-computation.
KV cache compression seeks to discern the saliency of tokens, preserving vital information while aggressively compressing those of less importance.
We present ZipCache, an accurate and efficient KV cache quantization method for LLMs.
arXiv Detail & Related papers (2024-05-23T07:37:16Z) - Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference [2.8241099113277666]
"Keyformer" is an innovative inference-time approach to mitigate the challenges associated with KV cache size and memory bandwidth utilization.
We evaluate Keyformer's performance across three foundational models: GPT-J, Cerebras-GPT, and MPT.
arXiv Detail & Related papers (2024-03-14T02:42:42Z) - Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference [78.65321721142624]
We focus on a memory bottleneck imposed by the key-value ( KV) cache.
Existing KV cache methods approach this problem by pruning or evicting large swaths of relatively less important KV pairs.
We propose LESS, a simple integration of a constant sized cache with eviction-based cache methods.
arXiv Detail & Related papers (2024-02-14T18:54:56Z) - SubGen: Token Generation in Sublinear Time and Memory [48.35076900702408]
Large language models (LLMs) have extensive memory requirements for token generation.
In this work, we focus on developing an efficient compression technique for the KV cache.
We have devised a novel caching method with sublinear complexity, employing online clustering on key tokens and online $ell$ sampling on values.
Not only does this algorithm ensure a sublinear memory footprint and sublinear time complexity, but we also establish a tight error bound for our approach.
arXiv Detail & Related papers (2024-02-08T22:17:40Z) - KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache [67.9776980972508]
We develop a tuning-free 2bit KV cache quantization algorithm named KIVI.
KIVI can enable Llama, Falcon, and Mistral models to maintain almost the same quality while using $mathbf2.6times$ less peak memory.
arXiv Detail & Related papers (2024-02-05T06:06:47Z) - H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large
Language Models [110.06476624089679]
We introduce a novel approach for implementing the KV cache which significantly reduces its memory footprint.
Our approach is based on the observation that a small portion of tokens contributes most of the value when computing attention scores.
We propose Heavy Hitter (H$$O), a KV cache eviction policy that dynamically retains a balance of recent and H$$ tokens.
arXiv Detail & Related papers (2023-06-24T20:11:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.