H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large
Language Models
- URL: http://arxiv.org/abs/2306.14048v3
- Date: Mon, 18 Dec 2023 19:10:00 GMT
- Title: H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large
Language Models
- Authors: Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng,
Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R\'e, Clark Barrett,
Zhangyang Wang, Beidi Chen
- Abstract summary: We introduce a novel approach for implementing the KV cache which significantly reduces its memory footprint.
Our approach is based on the observation that a small portion of tokens contributes most of the value when computing attention scores.
We propose Heavy Hitter (H$$O), a KV cache eviction policy that dynamically retains a balance of recent and H$$ tokens.
- Score: 110.06476624089679
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs), despite their recent impressive
accomplishments, are notably cost-prohibitive to deploy, particularly for
applications involving long-content generation, such as dialogue systems and
story writing. Often, a large amount of transient state information, referred
to as the KV cache, is stored in GPU memory in addition to model parameters,
scaling linearly with the sequence length and batch size. In this paper, we
introduce a novel approach for implementing the KV cache which significantly
reduces its memory footprint. Our approach is based on the noteworthy
observation that a small portion of tokens contributes most of the value when
computing attention scores. We call these tokens Heavy Hitters (H$_2$). Through
a comprehensive investigation, we find that (i) the emergence of H$_2$ is
natural and strongly correlates with the frequent co-occurrence of tokens in
the text, and (ii) removing them results in significant performance
degradation. Based on these insights, we propose Heavy Hitter Oracle (H$_2$O),
a KV cache eviction policy that dynamically retains a balance of recent and
H$_2$ tokens. We formulate the KV cache eviction as a dynamic submodular
problem and prove (under mild assumptions) a theoretical guarantee for our
novel eviction algorithm which could help guide future work. We validate the
accuracy of our algorithm with OPT, LLaMA, and GPT-NeoX across a wide range of
tasks. Our implementation of H$_2$O with 20% heavy hitters improves the
throughput over three leading inference systems DeepSpeed Zero-Inference,
Hugging Face Accelerate, and FlexGen by up to 29$\times$, 29$\times$, and
3$\times$ on OPT-6.7B and OPT-30B. With the same batch size, H2O can reduce the
latency by up to 1.9$\times$. The code is available at
https://github.com/FMInference/H2O.
Related papers
- BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference [2.3587921104010756]
We propose BUZZ, a novel KV caching algorithm to minimize cache memory usage while enhancing inference speed.
BUZZ employs a beehive-structured sparse cache, incorporating a sliding window to capture recent information.
We evaluate BUZZ on four real-world datasets: CNN/Daily Mail, XSUM, Wikitext, and 10-QA.
arXiv Detail & Related papers (2024-10-30T14:53:37Z) - Post-Training Sparse Attention with Double Sparsity [44.772593893621085]
"Double Sparsity" is a novel post-training sparse attention technique designed to alleviate this bottleneck by reducing KV cache access.
Double Sparsity combines token sparsity, which focuses on utilizing only the important tokens for computing self-attention, with channel sparsity, an approach that uses important feature channels for identifying important tokens.
With offloading, it achieves a decoding speed acceleration of 16.3$times$ compared to state-of-the-art solutions at a sequence length of 256K.
arXiv Detail & Related papers (2024-08-11T18:40:36Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.
This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.
We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - A Training-free Sub-quadratic Cost Transformer Model Serving Framework With Hierarchically Pruned Attention [43.211427581302715]
We propose Hierarchically Pruned Attention (HiP) to increase context length in large language models.
HiP reduces the time complexity of the attention mechanism to $O(T log T)$ and the space complexity to $O(T)$, where $T$ is the sequence length.
We show that HiP significantly reduces both prefill and decoding latencies, as well as memory usage, while maintaining high-quality generation with minimal degradation.
arXiv Detail & Related papers (2024-06-14T08:32:45Z) - Scalable 3D Registration via Truncated Entry-wise Absolute Residuals [65.04922801371363]
A $3$D registration approach can process more than ten million ($107$) point pairs with over $99%$ random outliers.
We call our method TEAR, as it involves minimizing an outlier-robust loss that computes Truncated Entry-wise Absolute Residuals.
arXiv Detail & Related papers (2024-04-01T04:43:39Z) - Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference [2.8241099113277666]
"Keyformer" is an innovative inference-time approach to mitigate the challenges associated with KV cache size and memory bandwidth utilization.
We evaluate Keyformer's performance across three foundational models: GPT-J, Cerebras-GPT, and MPT.
arXiv Detail & Related papers (2024-03-14T02:42:42Z) - Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference [78.65321721142624]
We focus on a memory bottleneck imposed by the key-value ( KV) cache.
Existing KV cache methods approach this problem by pruning or evicting large swaths of relatively less important KV pairs.
We propose LESS, a simple integration of a constant sized cache with eviction-based cache methods.
arXiv Detail & Related papers (2024-02-14T18:54:56Z) - KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache [67.9776980972508]
We develop a tuning-free 2bit KV cache quantization algorithm named KIVI.
KIVI can enable Llama, Falcon, and Mistral models to maintain almost the same quality while using $mathbf2.6times$ less peak memory.
arXiv Detail & Related papers (2024-02-05T06:06:47Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.