Accurate KV Cache Quantization with Outlier Tokens Tracing
- URL: http://arxiv.org/abs/2505.10938v1
- Date: Fri, 16 May 2025 07:23:12 GMT
- Title: Accurate KV Cache Quantization with Outlier Tokens Tracing
- Authors: Yi Su, Yuechi Zhou, Quantong Qiu, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, Min Zhang,
- Abstract summary: KV Cache quantization presents a promising solution, striking a good balance between memory usage and accuracy.<n>Previous research has shown that the Keys are distributed by channel, while the Values are distributed by token.<n>Our method achieves significant accuracy improvements under 2-bit quantization and can deliver a 6.4 times reduction in memory usage and a 2.3 times increase in throughput.
- Score: 44.722738059962296
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The impressive capabilities of Large Language Models (LLMs) come at the cost of substantial computational resources during deployment. While KV Cache can significantly reduce recomputation during inference, it also introduces additional memory overhead. KV Cache quantization presents a promising solution, striking a good balance between memory usage and accuracy. Previous research has shown that the Keys are distributed by channel, while the Values are distributed by token. Consequently, the common practice is to apply channel-wise quantization to the Keys and token-wise quantization to the Values. However, our further investigation reveals that a small subset of unusual tokens exhibit unique characteristics that deviate from this pattern, which can substantially impact quantization accuracy. To address this, we develop a simple yet effective method to identify these tokens accurately during the decoding process and exclude them from quantization as outlier tokens, significantly improving overall accuracy. Extensive experiments show that our method achieves significant accuracy improvements under 2-bit quantization and can deliver a 6.4 times reduction in memory usage and a 2.3 times increase in throughput.
Related papers
- NQKV: A KV Cache Quantization Scheme Based on Normal Distribution Characteristics [6.048883141729117]
Large Language Models (LLMs) have demonstrated remarkable proficiency across a wide range of tasks.<n>LLMs often require larger batch sizes to enhance throughput or longer context lengths to meet task demands.
arXiv Detail & Related papers (2025-05-22T04:23:19Z) - Pushing the Limits of Low-Bit Optimizers: A Focus on EMA Dynamics [64.62231094774211]
Statefuls (e.g., Adam) maintain auxiliary information even 2x the model size in order to achieve optimal convergence.<n>SOLO enables Adam-styles to maintain quantized states with precision as low as 3 bits, or even 2 bits.<n>SOLO can thus be seamlessly applied to Adam-styles, leading to substantial memory savings with minimal accuracy loss.
arXiv Detail & Related papers (2025-05-01T06:47:45Z) - SQuat: Subspace-orthogonal KV Cache Quantization [19.131705063324883]
We introduce SQuat (Subspace-orthogonal KV cache quantization), which reduces peak memory by 2.17 to 2.82, improves throughput by 2.45 to 3.60, and achieves more favorable benchmark scores than existing KV cache quantization algorithms.<n>We show that our method reduces peak memory by 2.17 to 2.82, improves throughput by 2.45 to 3.60, and achieves more favorable benchmark scores than existing KV cache quantization algorithms.
arXiv Detail & Related papers (2025-03-31T17:37:32Z) - SVDq: 1.25-bit and 410x Key Cache Compression for LLM Attention [0.0]
Three main types of KV cache compression techniques, namely sparsity, channel compression, and quantization, have been identified.<n>This study presents SVDq, a Singular Value Decomposition (SVD) - based mixed precision quantization method for K cache.
arXiv Detail & Related papers (2025-02-21T08:55:21Z) - More for Keys, Less for Values: Adaptive KV Cache Quantization [59.708443710731146]
This paper introduces an information-aware quantization framework that adaptively compresses the key-value cache in large language models.<n>We show that key matrices consistently exhibit higher norm values and are more sensitive to quantization than value matrices.<n>We propose a mixed-precision quantization strategy, KV-AdaQuant, which allocates more bitwidth for keys and fewer for values.
arXiv Detail & Related papers (2025-02-20T22:24:27Z) - More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression [71.42818367729573]
In large language models (LLMs), the memory usage of KV cache has become a critical bottleneck during inference.<n>The mainstream KV compression methods, including KV pruning and KV quantization, primarily focus on either token or precision dimension separately.<n>In this paper, we comprehensively investigate the token-precision trade-off in KV cache compression.
arXiv Detail & Related papers (2024-12-17T09:20:31Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression [87.5604418100301]
Key-value( KV) caching is an important technique to accelerate the inference of large language models.
Existing methods often compromise precision or require extra data for calibration.
We introduce textbfDecoQuant, a novel data-free low-bit quantization technique based on tensor decomposition methods.
arXiv Detail & Related papers (2024-05-21T08:35:10Z) - Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference [2.8241099113277666]
"Keyformer" is an innovative inference-time approach to mitigate the challenges associated with KV cache size and memory bandwidth utilization.
We evaluate Keyformer's performance across three foundational models: GPT-J, Cerebras-GPT, and MPT.
arXiv Detail & Related papers (2024-03-14T02:42:42Z) - KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization [67.74400574357472]
LLMs are seeing growing use for applications which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference.
Quantization is a promising approach for compressing KV cache activations; however, existing solutions fail to represent activations accurately in sub-4-bit precision.
Our work, KVQuant, facilitates low precision KV cache quantization by incorporating several novel methods.
arXiv Detail & Related papers (2024-01-31T18:58:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.