H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large
Language Models
- URL: http://arxiv.org/abs/2306.14048v3
- Date: Mon, 18 Dec 2023 19:10:00 GMT
- Title: H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large
Language Models
- Authors: Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng,
Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R\'e, Clark Barrett,
Zhangyang Wang, Beidi Chen
- Abstract summary: We introduce a novel approach for implementing the KV cache which significantly reduces its memory footprint.
Our approach is based on the observation that a small portion of tokens contributes most of the value when computing attention scores.
We propose Heavy Hitter (H$$O), a KV cache eviction policy that dynamically retains a balance of recent and H$$ tokens.
- Score: 110.06476624089679
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs), despite their recent impressive
accomplishments, are notably cost-prohibitive to deploy, particularly for
applications involving long-content generation, such as dialogue systems and
story writing. Often, a large amount of transient state information, referred
to as the KV cache, is stored in GPU memory in addition to model parameters,
scaling linearly with the sequence length and batch size. In this paper, we
introduce a novel approach for implementing the KV cache which significantly
reduces its memory footprint. Our approach is based on the noteworthy
observation that a small portion of tokens contributes most of the value when
computing attention scores. We call these tokens Heavy Hitters (H$_2$). Through
a comprehensive investigation, we find that (i) the emergence of H$_2$ is
natural and strongly correlates with the frequent co-occurrence of tokens in
the text, and (ii) removing them results in significant performance
degradation. Based on these insights, we propose Heavy Hitter Oracle (H$_2$O),
a KV cache eviction policy that dynamically retains a balance of recent and
H$_2$ tokens. We formulate the KV cache eviction as a dynamic submodular
problem and prove (under mild assumptions) a theoretical guarantee for our
novel eviction algorithm which could help guide future work. We validate the
accuracy of our algorithm with OPT, LLaMA, and GPT-NeoX across a wide range of
tasks. Our implementation of H$_2$O with 20% heavy hitters improves the
throughput over three leading inference systems DeepSpeed Zero-Inference,
Hugging Face Accelerate, and FlexGen by up to 29$\times$, 29$\times$, and
3$\times$ on OPT-6.7B and OPT-30B. With the same batch size, H2O can reduce the
latency by up to 1.9$\times$. The code is available at
https://github.com/FMInference/H2O.
Related papers
- HiP Attention: Sparse Sub-Quadratic Attention with Hierarchical Attention Pruning [47.822285290729496]
Hierarchically Pruned Attention (HiP) simultaneously reduces the training and inference time complexity from $O(T2)$ to $O(T2)$.
HiP is training-free as it only utilizes the pre-trained attention scores to spot the positions of the top-$k$ most significant elements for each query.
Experiments on diverse real-world benchmarks demonstrate that HiP significantly reduces prompt (i.e., prefill) and decoding latency and memory usage.
arXiv Detail & Related papers (2024-06-14T08:32:45Z) - Scalable 3D Registration via Truncated Entry-wise Absolute Residuals [65.04922801371363]
A $3$D registration approach can process more than ten million ($107$) point pairs with over $99%$ random outliers.
We call our method TEAR, as it involves minimizing an outlier-robust loss that computes Truncated Entry-wise Absolute Residuals.
arXiv Detail & Related papers (2024-04-01T04:43:39Z) - Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference [2.8241099113277666]
"Keyformer" is an innovative inference-time approach to mitigate the challenges associated with KV cache size and memory bandwidth utilization.
We evaluate Keyformer's performance across three foundational models: GPT-J, Cerebras-GPT, and MPT.
arXiv Detail & Related papers (2024-03-14T02:42:42Z) - Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference [78.65321721142624]
We focus on a memory bottleneck imposed by the key-value ( KV) cache.
Existing KV cache methods approach this problem by pruning or evicting large swaths of relatively less important KV pairs.
We propose LESS, a simple integration of a constant sized cache with eviction-based cache methods.
arXiv Detail & Related papers (2024-02-14T18:54:56Z) - SubGen: Token Generation in Sublinear Time and Memory [48.35076900702408]
Large language models (LLMs) have extensive memory requirements for token generation.
In this work, we focus on developing an efficient compression technique for the KV cache.
We have devised a novel caching method with sublinear complexity, employing online clustering on key tokens and online $ell$ sampling on values.
Not only does this algorithm ensure a sublinear memory footprint and sublinear time complexity, but we also establish a tight error bound for our approach.
arXiv Detail & Related papers (2024-02-08T22:17:40Z) - KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache [67.9776980972508]
We develop a tuning-free 2bit KV cache quantization algorithm named KIVI.
KIVI can enable Llama, Falcon, and Mistral models to maintain almost the same quality while using $mathbf2.6times$ less peak memory.
arXiv Detail & Related papers (2024-02-05T06:06:47Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.