GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless
Generative Inference of LLM
- URL: http://arxiv.org/abs/2403.05527v2
- Date: Mon, 11 Mar 2024 18:55:40 GMT
- Title: GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless
Generative Inference of LLM
- Authors: Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu,
Tushar Krishna, Tuo Zhao
- Abstract summary: Key-value (KV) caching has become the de-facto to accelerate generation speed for large language models (LLMs) inference.
Existing methods rely on dropping unimportant tokens or quantizing all entries uniformly.
We propose GEAR, an efficient KV cache compression framework that achieves near-lossless high-ratio compression.
- Score: 39.77567916589569
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Key-value (KV) caching has become the de-facto to accelerate generation speed
for large language models (LLMs) inference. However, the growing cache demand
with increasing sequence length has transformed LLM inference to be a memory
bound problem, significantly constraining the system throughput. Existing
methods rely on dropping unimportant tokens or quantizing all entries
uniformly. Such methods, however, often incur high approximation errors to
represent the compressed matrices. The autoregressive decoding process further
compounds the error of each step, resulting in critical deviation in model
generation and deterioration of performance. To tackle this challenge, we
propose GEAR, an efficient KV cache compression framework that achieves
near-lossless high-ratio compression. GEAR first applies quantization to
majority of entries of similar magnitudes to ultra-low precision. It then
employs a low rank matrix to approximate the quantization error, and a sparse
matrix to remedy individual errors from outlier entries. By adeptly integrating
three techniques, GEAR is able to fully exploit their synergistic potentials.
Our experiments demonstrate that compared to alternatives, GEAR achieves
near-lossless 4-bit KV cache compression with up to 2.38x throughput
improvement, while reducing peak-memory size up to 2.29x. Our code is publicly
available at https://github.com/HaoKang-Timmy/GEAR.
Related papers
- ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification [19.985314022860432]
KV cache stores key and value states from previous tokens to avoid re-computation.
KV cache compression seeks to discern the saliency of tokens, preserving vital information while aggressively compressing those of less importance.
We present ZipCache, an accurate and efficient KV cache quantization method for LLMs.
arXiv Detail & Related papers (2024-05-23T07:37:16Z) - Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression [87.5604418100301]
Key-value( KV) caching is an important technique to accelerate the inference of large language models.
Existing methods often compromise precision or require extra data for calibration.
We introduce textbfDecoQuant, a novel data-free low-bit quantization technique based on tensor decomposition methods.
arXiv Detail & Related papers (2024-05-21T08:35:10Z) - PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference [57.53291046180288]
Large Language Models (LLMs) have shown remarkable comprehension abilities but face challenges in GPU memory usage during inference.
We propose PyramidInfer, a method that compresses the KV cache by layer-wise retaining crucial context.
PyramidInfer improves 2.2x throughput compared to Accelerate with over 54% GPU memory reduction in KV cache.
arXiv Detail & Related papers (2024-05-21T06:46:37Z) - Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference [1.9639467358416092]
Transformers have emerged as the backbone of large language models (LLMs)
We propose Dynamic Memory Compression (DMC), a method for online key-value cache compression at inference time.
We retrofit pre-trained LLMs such as Llama 2 (7B, 13B and 70B) into DMC Transformers, achieving up to 7x throughput increase during auto-regressive inference on an NVIDIA H100 GPU.
arXiv Detail & Related papers (2024-03-14T17:59:26Z) - Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference [78.65321721142624]
We focus on a memory bottleneck imposed by the key-value ( KV) cache.
Existing KV cache methods approach this problem by pruning or evicting large swaths of relatively less important KV pairs.
We propose LESS, a simple integration of a constant sized cache with eviction-based cache methods.
arXiv Detail & Related papers (2024-02-14T18:54:56Z) - KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache [67.9776980972508]
We develop a tuning-free 2bit KV cache quantization algorithm named KIVI.
KIVI can enable Llama, Falcon, and Mistral models to maintain almost the same quality while using $mathbf2.6times$ less peak memory.
arXiv Detail & Related papers (2024-02-05T06:06:47Z) - LoMA: Lossless Compressed Memory Attention [0.0]
Lossless Compressed Memory Attention (LoMA) is a novel approach to reduce memory and computational demands during autoregressive generation.
LoMA incorporates a specialized training or fine-tuning precedure alongside an autoregressive generation algorithm optimized for the compressed context.
Experimental validation has demonstrated that LoMA significantly reducing computational consumption and memory usage.
arXiv Detail & Related papers (2024-01-16T09:18:46Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.