PolarQuant: Leveraging Polar Transformation for Efficient Key Cache Quantization and Decoding Acceleration
- URL: http://arxiv.org/abs/2502.00527v1
- Date: Sat, 01 Feb 2025 18:59:03 GMT
- Title: PolarQuant: Leveraging Polar Transformation for Efficient Key Cache Quantization and Decoding Acceleration
- Authors: Songhao Wu, Ang Lv, Xiao Feng, Yufei Zhang, Xun Zhang, Guojun Yin, Wei Lin, Rui Yan,
- Abstract summary: Quantizing the KV cache to lower bit widths is an effective way to reduce computational costs.<n>Previous methods struggle with quantizing key vectors due to outliers, resulting in excessive overhead.<n>We propose a novel quantization approach called PolarQuant, which efficiently addresses the outlier challenge.
- Score: 26.972039704548184
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The KV cache in large language models is a dominant factor in memory usage, limiting their broader applicability. Quantizing the cache to lower bit widths is an effective way to reduce computational costs; however, previous methods struggle with quantizing key vectors due to outliers, resulting in excessive overhead. We propose a novel quantization approach called PolarQuant, which efficiently addresses the outlier challenge. We observe that outliers typically appear in only one of two dimensions, which are rotated together by a specific angle when rotary position embeddings are applied. When represented as two-dimensional vectors, these dimensions exhibit well-structured patterns, with radii and angles smoothly distributed in polar coordinates. This alleviates the challenge of outliers on per-channel quantization, making them well-suited for quantization. Thus, PolarQuant divides key vectors into groups of two-dimensional sub-vectors, encoding them as the corresponding quantized radius and the polar angle, rather than quantizing original key vectors directly. PolarQuant achieves the superior efficiency in KV cache quantization and accelerates the decoding process by turning the query-key inner product into a table lookup, all while maintaining the downstream performance of full-precision models.
Related papers
- PCDVQ: Enhancing Vector Quantization for Large Language Models via Polar Coordinate Decoupling [53.91873442457923]
Vector Quantization (VQ) serves as a prevalent solution to this issue for its extremely low-bit (even at 2-bit) and considerable accuracy.<n>This paper proposes Polar Coordinate Decoupled Vector Quantization (PCDVQ), an effective and efficient VQ framework.<n> Experimental results show that PCDVQ outperforms baseline methods at 2-bit level by at least 1.5% zero-shot accuracy.
arXiv Detail & Related papers (2025-06-05T08:58:58Z) - TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate [13.14434628836727]
Vector quantization aims to quantize high-dimensional Euclidean vectors while minimizing distortion in their geometric structure.
We propose TurboQuant to address both mean-squared error (MSE) and inner product distortion.
Our data-oblivious algorithms, suitable for online applications, achieve near-optimal distortion rates.
arXiv Detail & Related papers (2025-04-28T15:05:35Z) - More for Keys, Less for Values: Adaptive KV Cache Quantization [59.708443710731146]
This paper introduces an information-aware quantization framework that adaptively compresses the key-value cache in large language models.
We show that key matrices consistently exhibit higher norm values and are more sensitive to quantization than value matrices.
We propose a mixed-precision quantization strategy, KV-AdaQuant, which allocates more bitwidth for keys and fewer for values.
arXiv Detail & Related papers (2025-02-20T22:24:27Z) - PolarQuant: Quantizing KV Caches with Polar Transformation [46.38603611763045]
Large language models (LLMs) require significant memory to store Key-Value embeddings in their KV cache.
Quantization of these KV embeddings is a common technique to reduce memory consumption.
This work introduces PolarQuant, a novel quantization method employing random preconditioning and polar transformation.
arXiv Detail & Related papers (2025-02-04T08:52:13Z) - Memory-Efficient 4-bit Preconditioned Stochastic Optimization [53.422307389223626]
We introduce 4-bit quantization for Shampoo's preconditioners.<n>To our knowledge, this is the first quantization approach applied to Cholesky factors of preconditioners.
arXiv Detail & Related papers (2024-12-14T03:32:54Z) - Residual vector quantization for KV cache compression in large language model [2.3094645821058735]
KV cache compression methods have mainly relied on scalar quantization techniques to reduce the memory requirements during decoding.
In this work, we apply residual vector quantization, which has been widely used for high fidelity audio compression, to compress KV cache in large language models (LLM)
We learn the codebook using exponential moving average and there are no other learnable parameters including the input and output projections normally used in a vector quantization set up.
arXiv Detail & Related papers (2024-10-21T07:20:41Z) - FlatQuant: Flatness Matters for LLM Quantization [58.28221892035609]
We propose FlatQuant, a new post-training quantization approach that enhances the flatness of weights and activations.<n>Our approach identifies optimal affine transformations for each linear layer, calibrated in hours via a lightweight objective.<n>It achieves less than 1% accuracy drop for W4A4 quantization on the LLaMA-3-70B model, surpassing SpinQuant by 7.5%.
arXiv Detail & Related papers (2024-10-12T08:10:28Z) - DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs [40.48697728884967]
Quantization of large language models (LLMs) faces significant challenges, particularly due to the presence of outlier activations.
Traditional approaches predominantly address Normal Outliers, which are activations across all tokens with relatively large magnitudes.
We introduce DuQuant, a novel approach that utilizes rotation and permutation transformations to more effectively mitigate both massive and normal outliers.
arXiv Detail & Related papers (2024-06-03T18:27:44Z) - Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression [87.5604418100301]
Key-value( KV) caching is an important technique to accelerate the inference of large language models.
Existing methods often compromise precision or require extra data for calibration.
We introduce textbfDecoQuant, a novel data-free low-bit quantization technique based on tensor decomposition methods.
arXiv Detail & Related papers (2024-05-21T08:35:10Z) - Quantization of Large Language Models with an Overdetermined Basis [73.79368761182998]
We introduce an algorithm for data quantization based on the principles of Kashin representation.
Our findings demonstrate that Kashin Quantization achieves competitive or superior quality in model performance.
arXiv Detail & Related papers (2024-04-15T12:38:46Z) - Efficient Quantum Circuits for Non-Unitary and Unitary Diagonal Operators with Space-Time-Accuracy trade-offs [1.0749601922718608]
Unitary and non-unitary diagonal operators are fundamental building blocks in quantum algorithms.<n>We introduce a general approach to implement unitary and non-unitary diagonal operators with efficient-adjustable-depth quantum circuits.
arXiv Detail & Related papers (2024-04-03T15:42:25Z) - An Efficient Algorithm for Clustered Multi-Task Compressive Sensing [60.70532293880842]
Clustered multi-task compressive sensing is a hierarchical model that solves multiple compressive sensing tasks.
The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions.
We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices.
arXiv Detail & Related papers (2023-09-30T15:57:14Z) - Automatic and effective discovery of quantum kernels [41.61572387137452]
Quantum computing can empower machine learning models by enabling kernel machines to leverage quantum kernels for representing similarity measures between data.<n>We present an approach to this problem, which employs optimization techniques, similar to those used in neural architecture search and AutoML.<n>The results obtained by testing our approach on a high-energy physics problem demonstrate that, in the best-case scenario, we can either match or improve testing accuracy with respect to the manual design approach.
arXiv Detail & Related papers (2022-09-22T16:42:14Z) - Quantum algorithms for grid-based variational time evolution [36.136619420474766]
We propose a variational quantum algorithm for performing quantum dynamics in first quantization.
Our simulations exhibit the previously observed numerical instabilities of variational time propagation approaches.
arXiv Detail & Related papers (2022-03-04T19:00:45Z) - Efficient multi-qubit subspace rotations via topological quantum walks [1.0486921990935787]
The rotation of subspaces by a chosen angle is a fundamental quantum computing operation.
We propose a fast, high-fidelity way to implement such operations via topological quantum walks.
This procedure can be implemented in superconducting qubits, ion-traps and Rydberg atoms with star-type connectivity.
arXiv Detail & Related papers (2021-11-12T02:10:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.