Related papers: Quantize What Counts: Bit Allocation Insights Informed by Spectral Gaps in Keys and Values

Quantize What Counts: Bit Allocation Insights Informed by Spectral Gaps in Keys and Values

URL: http://arxiv.org/abs/2502.15075v2
Date: Fri, 23 May 2025 04:58:47 GMT
Title: Quantize What Counts: Bit Allocation Insights Informed by Spectral Gaps in Keys and Values
Authors: Mohsen Hariri, Alan Luo, Mohammadreza Nemati, Lam Nguyen, Shaochen Zhong, Qifan Wang, Xia Hu, Xiaotian Han, Vipin Chaudhary,
Abstract summary: We provide two novel theorems aimed at enhancing KV quantization methods.<n>Our first theorem, termed Key-Value Norm Disparity, states that the key weight matrices by nature carry richer information.<n>Our second theorem, Key-Driven Quantization, posits that prioritizing the quantization precision of keys over values induces significant improvements to the overall quantization performance.
Score: 57.54443445583921
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have introduced significant advancements to the capabilities of Natural Language Processing (NLP) in recent years. However, as these models continue to scale in size, memory constraints pose substantial challenge. Key and Value cache (KV cache) quantization has been well-documented as a promising solution to this limitation. In this work, we provide two novel theorems aimed at enhancing KV quantization methods. Our first theorem, termed Key-Value Norm Disparity, states that the key weight matrices by nature carry richer information compared to the value weight matrices, as evidenced by higher spectral and Frobenius norms across most of the layers. Our second theorem, Key-Driven Quantization, posits that prioritizing the quantization precision of keys over values induces significant improvements to the overall quantization performance. In particular, assigning greater precision to the keys compared to the values achieves a higher degree of precision reduction with minimal impact on model accuracy. We validate these theorems through theory and extensive experiments on several state-of-the-art LLM architectures and benchmarks. These findings offer valuable guidelines for improving KV cache quantization strategies, facilitating more efficient memory utilization without compromising model performance across diverse NLP tasks. Source code is available at https://github.com/mohsenhariri/spectral-kv.

Related papers

Draft-based Approximate Inference for LLMs [7.287280338330983]
We propose a novel framework for approximate Large Language Models inference.<n>We introduce two instantiations of our proposed framework: (i) SpecKV, the first method that leverages a draft output to accurately assess the importance of each KV pair for more effective KV cache dropping, and (ii) SpecPC, which uses the draft model's attention activations to identify and discard unimportant prompt tokens.<n>Our methods consistently achieve higher accuracy than existing baselines, while preserving the same improvements in memory usage, latency, and throughput.
arXiv Detail & Related papers (2025-06-10T02:37:46Z)
Pushing the Limits of Low-Bit Optimizers: A Focus on EMA Dynamics [64.62231094774211]
Statefuls (e.g., Adam) maintain auxiliary information even 2x the model size in order to achieve optimal convergence.<n>SOLO enables Adam-styles to maintain quantized states with precision as low as 3 bits, or even 2 bits.<n>SOLO can thus be seamlessly applied to Adam-styles, leading to substantial memory savings with minimal accuracy loss.
arXiv Detail & Related papers (2025-05-01T06:47:45Z)
SQuat: Subspace-orthogonal KV Cache Quantization [19.131705063324883]
We introduce SQuat (Subspace-orthogonal KV cache quantization), which reduces peak memory by 2.17 to 2.82, improves throughput by 2.45 to 3.60, and achieves more favorable benchmark scores than existing KV cache quantization algorithms. We show that our method reduces peak memory by 2.17 to 2.82, improves throughput by 2.45 to 3.60, and achieves more favorable benchmark scores than existing KV cache quantization algorithms.
arXiv Detail & Related papers (2025-03-31T17:37:32Z)
RSQ: Learning from Important Tokens Leads to Better Quantized LLMs [65.5558181902098]
Layer-wise quantization is a key technique for efficiently compressing large models without expensive retraining.<n>We propose RSQ (Rotate, Scale, then Quantize), which applies rotations to the model to mitigate outliers.<n>We demonstrate that RSQ consistently outperforms baseline methods across multiple downstream tasks and three model families.
arXiv Detail & Related papers (2025-03-03T18:46:33Z)
SVDq: 1.25-bit and 410x Key Cache Compression for LLM Attention [0.0]
Three main types of KV cache compression techniques, namely sparsity, channel compression, and quantization, have been identified.<n>This study presents SVDq, a Singular Value Decomposition (SVD) - based mixed precision quantization method for K cache.
arXiv Detail & Related papers (2025-02-21T08:55:21Z)
Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding [58.364933651703524]
We show that concentrated massive values consistently emerge in specific regions of attention queries.<n>These massive values play a critical role in interpreting contextual knowledge.<n>We trace the emergence of massive values and find that such concentration is caused by Rotary Positional.
arXiv Detail & Related papers (2025-02-03T17:47:03Z)
More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression [71.42818367729573]
In large language models (LLMs), the memory usage of KV cache has become a critical bottleneck during inference. The mainstream KV compression methods, including KV pruning and KV quantization, primarily focus on either token or precision dimension separately. In this paper, we comprehensively investigate the token-precision trade-off in KV cache compression.
arXiv Detail & Related papers (2024-12-17T09:20:31Z)
Memory-Efficient 4-bit Preconditioned Stochastic Optimization [53.422307389223626]
We introduce 4-bit quantization for Shampoo's preconditioners. To our knowledge, this is the first quantization approach applied to Cholesky factors of preconditioners. We demonstrate that combining Cholesky quantization with error feedback enhances memory efficiency and algorithm performance.
arXiv Detail & Related papers (2024-12-14T03:32:54Z)
Residual vector quantization for KV cache compression in large language model [2.3094645821058735]
KV cache compression methods have mainly relied on scalar quantization techniques to reduce the memory requirements during decoding. In this work, we apply residual vector quantization, which has been widely used for high fidelity audio compression, to compress KV cache in large language models (LLM) We learn the codebook using exponential moving average and there are no other learnable parameters including the input and output projections normally used in a vector quantization set up.
arXiv Detail & Related papers (2024-10-21T07:20:41Z)
AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations [36.63586957377984]
Large language models often require substantial storage space. Due to their massive parameter count, these models often require substantial storage space. One research direction proposes to compress the models using integer replacements for floating-point numbers.
arXiv Detail & Related papers (2024-10-17T04:35:57Z)
AlignedKV: Reducing Memory Access of KV-Cache with Precision-Aligned Quantization [5.572159724234467]
Mixed-precision quantization distinguishes between important and unimportant parameters. Existing approaches can only identify important parameters through qualitative analysis and manual experiments. We propose a new criterion, so-called 'precision alignment', to build a quantitative framework to holistically evaluate the importance of parameters.
arXiv Detail & Related papers (2024-09-25T01:39:02Z)
Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression [87.5604418100301]
Key-value( KV) caching is an important technique to accelerate the inference of large language models. Existing methods often compromise precision or require extra data for calibration. We introduce textbfDecoQuant, a novel data-free low-bit quantization technique based on tensor decomposition methods.
arXiv Detail & Related papers (2024-05-21T08:35:10Z)
LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit [55.73370804397226]
Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating large language models. We present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization. Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats.
arXiv Detail & Related papers (2024-05-09T11:49:05Z)
WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More [55.0856305773081]
Large Language Models (LLMs) face significant deployment challenges due to their substantial memory requirements and the computational demands of auto-regressive text generation process. This paper addresses these challenges by focusing on the quantization of LLMs, a technique that reduces memory consumption by converting model parameters and activations into low-bit integers.
arXiv Detail & Related papers (2024-02-19T11:33:21Z)
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization [67.74400574357472]
LLMs are seeing growing use for applications which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference. Quantization is a promising approach for compressing KV cache activations; however, existing solutions fail to represent activations accurately in sub-4-bit precision. Our work, KVQuant, facilitates low precision KV cache quantization by incorporating several novel methods.
arXiv Detail & Related papers (2024-01-31T18:58:14Z)
Norm Tweaking: High-performance Low-bit Quantization of Large Language Models [21.855106896725598]
We introduce a technique called norm tweaking, which can be used as a plugin in current PTQ methods to achieve high precision. Our method demonstrates significant improvements in both weight-only quantization and joint quantization of weights and activations. Our simple and effective approach makes it more practical for real-world applications.
arXiv Detail & Related papers (2023-09-06T06:51:15Z)
Towards Accurate Post-Training Quantization for Vision Transformer [48.779346466374406]
Existing post-training quantization methods still cause severe performance drops. APQ-ViT surpasses the existing post-training quantization methods by convincing margins.
arXiv Detail & Related papers (2023-03-25T03:05:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.