Related papers: KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

URL: http://arxiv.org/abs/2401.18079v5
Date: Fri, 25 Oct 2024 18:29:43 GMT
Title: KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
Authors: Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami,
Abstract summary: LLMs are seeing growing use for applications which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference. Quantization is a promising approach for compressing KV cache activations; however, existing solutions fail to represent activations accurately in sub-4-bit precision. Our work, KVQuant, facilitates low precision KV cache quantization by incorporating several novel methods.
Score: 67.74400574357472
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: LLMs are seeing growing use for applications which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference. Quantization is a promising approach for compressing KV cache activations; however, existing solutions fail to represent activations accurately in sub-4-bit precision. Our work, KVQuant, facilitates low precision KV cache quantization by incorporating several novel methods: (i) Per-Channel Key Quantization, where we adjust the dimension along which we quantize the Key activations to better match the distribution; (ii) Pre-RoPE Key Quantization, where we quantize Key activations before the rotary positional embedding to mitigate its impact on quantization; (iii) Non-Uniform KV Cache Quantization, where we derive per-layer sensitivity-weighted non-uniform datatypes that better represent the distributions; and (iv) Per-Vector Dense-and-Sparse Quantization, where we isolate outliers separately for each vector to minimize skews in quantization ranges. By applying our method to the LLaMA, Llama-2, Llama-3, and Mistral models, we achieve < 0.1 perplexity degradation with 3-bit quantization on both Wikitext-2 and C4, outperforming existing approaches. Our method enables serving LLaMA-7B with a context length of up to 1 million on a single A100-80GB GPU and up to 10 million on an 8-GPU system. We develop custom CUDA kernels for KVQuant, showing that we can achieve up to ~1.7x speedups, compared to baseline fp16 matrix-vector multiplications, for the LLaMA-7B model.

Related papers

KurTail : Kurtosis-based LLM Quantization [51.24081396305435]
KurTail is a new post-training quantization scheme that mitigates outliers in the activations of large language models. It offers a 13.3% boost in MMLU accuracy and a 15.5% drop in Wiki perplexity compared to QuaRot. It also outperforms SpinQuant with a 2.6% MMLU gain and reduces perplexity by 2.9%, all while reducing the training cost.
arXiv Detail & Related papers (2025-03-03T12:43:06Z)
CalibQuant: 1-Bit KV Cache Quantization for Multimodal LLMs [45.77132019859689]
CalibQuant is a visual quantization strategy that drastically reduces both memory and computational overhead. We achieve a 10x throughput increase on InternVL models.
arXiv Detail & Related papers (2025-02-15T05:08:01Z)
KVTuner: Sensitivity-Aware Layer-wise Mixed Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference [40.97781175723418]
KV cache quantization can improve Large Language Models inference throughput and latency in long contexts. Current methods have three unsolved issues: overlooking layer-wise sensitivity to KV cache quantization, high overhead of online fine-grained decision-making, and low flexibility to different LLMs and constraints. We propose a simple yet effective framework KVTuner to adaptively search for the optimal hardware-friendly layer-wise KV quantization precision pairs for coarse-grained KV cache.
arXiv Detail & Related papers (2025-02-06T15:26:26Z)
AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations [36.63586957377984]
Large language models often require substantial storage space. Due to their massive parameter count, these models often require substantial storage space. One research direction proposes to compress the models using integer replacements for floating-point numbers.
arXiv Detail & Related papers (2024-10-17T04:35:57Z)
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs) Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time. We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z)
ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications. This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference. We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead [10.067037913589175]
Serving LLMs requires substantial memory due to the storage requirements of Key-Value embeddings in the KV cache. Traditional quantization methods face significant memory overhead due to the need to store quantization constants. We introduce QJL, a new quantization approach that consists of a Johnson-Lindenstrauss transform followed by sign-bit quantization.
arXiv Detail & Related papers (2024-06-05T17:42:05Z)
Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression [87.5604418100301]
Key-value( KV) caching is an important technique to accelerate the inference of large language models. Existing methods often compromise precision or require extra data for calibration. We introduce textbfDecoQuant, a novel data-free low-bit quantization technique based on tensor decomposition methods.
arXiv Detail & Related papers (2024-05-21T08:35:10Z)
WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More [55.0856305773081]
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process. This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers.
arXiv Detail & Related papers (2024-02-19T11:33:21Z)
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache [67.9776980972508]
We develop a tuning-free 2bit KV cache quantization algorithm named KIVI. KIVI can enable Llama, Falcon, and Mistral models to maintain almost the same quality while using $mathbf2.6times$ less peak memory.
arXiv Detail & Related papers (2024-02-05T06:06:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.