Related papers: GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM

GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM

URL: http://arxiv.org/abs/2403.05527v4
Date: Mon, 30 Sep 2024 22:44:58 GMT
Title: GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM
Authors: Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, Tuo Zhao,
Abstract summary: Key-value (KV) caching has become the de-facto to accelerate generation speed for large language models (LLMs) inference. Existing methods rely on dropping unimportant tokens or quantizing all entries uniformly. We propose GEAR, an efficient KV cache compression framework that achieves near-lossless high-ratio compression.
Score: 37.87634266742105
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Key-value (KV) caching has become the de-facto to accelerate generation speed for large language models (LLMs) inference. However, the growing cache demand with increasing sequence length has transformed LLM inference to be a memory bound problem, significantly constraining the system throughput. Existing methods rely on dropping unimportant tokens or quantizing all entries uniformly. Such methods, however, often incur high approximation errors to represent the compressed matrices. The autoregressive decoding process further compounds the error of each step, resulting in critical deviation in model generation and deterioration of performance. To tackle this challenge, we propose GEAR, an efficient KV cache compression framework that achieves near-lossless high-ratio compression. GEAR first applies quantization to majority of entries of similar magnitudes to ultra-low precision. It then employs a low rank matrix to approximate the quantization error, and a sparse matrix to remedy individual errors from outlier entries. By adeptly integrating three techniques, GEAR is able to fully exploit their synergistic potentials. Our experiments demonstrate that compared to alternatives, GEAR achieves near-lossless 4-bit KV cache compression with up to 2.38x throughput improvement, while reducing peak-memory size up to 2.29x. Our code is publicly available at https://github.com/HaoKang-Timmy/GEAR.

Related papers

When Compression Meets Model Compression: Memory-Efficient Double Compression for Large Language Models [12.687035979970194]
This paper introduces a framework to compress large language models (LLMs) after quantization. A compression-aware quantization is first proposed to enhance model weight compressibility by re-scaling the model parameters before quantization, followed by a pruning method to improve further. Experiments show inference with the compressed model can achieve a 40% reduction in memory size with negligible loss in accuracy and inference speed.
arXiv Detail & Related papers (2025-02-21T13:11:22Z)
Compression for Better: A General and Stable Lossless Compression Framework [7.356622397575378]
Key challenge is effectively leveraging compression errors to minimize model loss. We propose a general textbfLosstextbfLess textbfCompression theoretical framework (textbfLLC) We apply various compression techniques, including quantization and decomposition.
arXiv Detail & Related papers (2024-12-09T09:55:54Z)
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs) Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time. We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z)
Palu: Compressing KV-Cache with Low-Rank Projection [7.2863629986391025]
This paper presents a KV-Cache compression framework called Palu. Palu decomposes the linear layers into low-rank matrices, caches compressed intermediate states, and reconstructs the full keys and values on the fly. Experiments show that Palu compresses KV-Cache by 50% while maintaining strong accuracy and delivering up to 1.89x on the RoPE-based attention module.
arXiv Detail & Related papers (2024-07-30T18:19:38Z)
ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications. This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference. We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification [19.985314022860432]
KV cache stores key and value states from previous tokens to avoid re-computation. KV cache compression seeks to discern the saliency of tokens, preserving vital information while aggressively compressing those of less importance. We present ZipCache, an accurate and efficient KV cache quantization method for LLMs.
arXiv Detail & Related papers (2024-05-23T07:37:16Z)
Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression [87.5604418100301]
Key-value( KV) caching is an important technique to accelerate the inference of large language models. Existing methods often compromise precision or require extra data for calibration. We introduce textbfDecoQuant, a novel data-free low-bit quantization technique based on tensor decomposition methods.
arXiv Detail & Related papers (2024-05-21T08:35:10Z)
PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference [57.53291046180288]
Large Language Models (LLMs) have shown remarkable comprehension abilities but face challenges in GPU memory usage during inference. We propose PyramidInfer, a method that compresses the KV cache by layer-wise retaining crucial context. PyramidInfer improves 2.2x throughput compared to Accelerate with over 54% GPU memory reduction in KV cache.
arXiv Detail & Related papers (2024-05-21T06:46:37Z)
Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference [78.65321721142624]
We focus on a memory bottleneck imposed by the key-value ( KV) cache. Existing KV cache methods approach this problem by pruning or evicting large swaths of relatively less important KV pairs. We propose LESS, a simple integration of a constant sized cache with eviction-based cache methods.
arXiv Detail & Related papers (2024-02-14T18:54:56Z)
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache [67.9776980972508]
We develop a tuning-free 2bit KV cache quantization algorithm named KIVI. KIVI can enable Llama, Falcon, and Mistral models to maintain almost the same quality while using $mathbf2.6times$ less peak memory.
arXiv Detail & Related papers (2024-02-05T06:06:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.