PolarQuant: Quantizing KV Caches with Polar Transformation
        - URL: http://arxiv.org/abs/2502.02617v1
 - Date: Tue, 04 Feb 2025 08:52:13 GMT
 - Title: PolarQuant: Quantizing KV Caches with Polar Transformation
 - Authors: Insu Han, Praneeth Kacham, Amin Karbasi, Vahab Mirrokni, Amir Zandieh, 
 - Abstract summary: Large language models (LLMs) require significant memory to store Key-Value embeddings in their KV cache.<n> Quantization of these KV embeddings is a common technique to reduce memory consumption.<n>This work introduces PolarQuant, a novel quantization method employing random preconditioning and polar transformation.
 - Score: 46.38603611763045
 - License: http://creativecommons.org/licenses/by/4.0/
 - Abstract:   Large language models (LLMs) require significant memory to store Key-Value (KV) embeddings in their KV cache, especially when handling long-range contexts. Quantization of these KV embeddings is a common technique to reduce memory consumption. This work introduces PolarQuant, a novel quantization method employing random preconditioning and polar transformation. Our method transforms the KV embeddings into polar coordinates using an efficient recursive algorithm and then quantizes resulting angles. Our key insight is that, after random preconditioning, the angles in the polar representation exhibit a tightly bounded and highly concentrated distribution with an analytically computable form. This nice distribution eliminates the need for explicit normalization, a step required by traditional quantization methods which introduces significant memory overhead because quantization parameters (e.g., zero point and scale) must be stored in full precision per each data block. PolarQuant bypasses this normalization step, enabling substantial memory savings. The long-context evaluation demonstrates that PolarQuant compresses the KV cache by over x4.2 while achieving the best quality scores compared to the state-of-the-art methods. 
 
       
      
        Related papers
        - TaDA: Training-free recipe for Decoding with Adaptive KV Cache   Compression and Mean-centering [10.427881558469442]
We introduce TaDA, a training-free recipe for KV cache compression with quantization precision.<n>Our approach yields substantial accuracy improvements for multiple models supporting various context lengths.<n>Our method paves the way for scalable and high-performance reasoning in language models.
arXiv  Detail & Related papers  (2025-06-05T05:23:38Z) - NQKV: A KV Cache Quantization Scheme Based on Normal Distribution   Characteristics [6.048883141729117]
Large Language Models (LLMs) have demonstrated remarkable proficiency across a wide range of tasks.<n>LLMs often require larger batch sizes to enhance throughput or longer context lengths to meet task demands.
arXiv  Detail & Related papers  (2025-05-22T04:23:19Z) - PolarQuant: Leveraging Polar Transformation for Efficient Key Cache   Quantization and Decoding Acceleration [26.972039704548184]
Quantizing the KV cache to lower bit widths is an effective way to reduce computational costs.<n>Previous methods struggle with quantizing key vectors due to outliers, resulting in excessive overhead.<n>We propose a novel quantization approach called PolarQuant, which efficiently addresses the outlier challenge.
arXiv  Detail & Related papers  (2025-02-01T18:59:03Z) - More Tokens, Lower Precision: Towards the Optimal Token-Precision   Trade-off in KV Cache Compression [71.42818367729573]
KV compression methods, including KV pruning and KV quantization, focus on either token or precision dimension.<n>We show that storing more tokens in the KV cache with lower precision, i.e., quantized pruning, can significantly enhance the long-context performance of LLMs.
arXiv  Detail & Related papers  (2024-12-17T09:20:31Z) - Residual vector quantization for KV cache compression in large language   model [2.3094645821058735]
KV cache compression methods have mainly relied on scalar quantization techniques to reduce the memory requirements during decoding.
In this work, we apply residual vector quantization, which has been widely used for high fidelity audio compression, to compress KV cache in large language models (LLM)
We learn the codebook using exponential moving average and there are no other learnable parameters including the input and output projections normally used in a vector quantization set up.
arXiv  Detail & Related papers  (2024-10-21T07:20:41Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive   Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv  Detail & Related papers  (2024-10-04T03:10:53Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.
This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.
We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv  Detail & Related papers  (2024-07-30T17:59:08Z) - QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero   Overhead [10.067037913589175]
Serving LLMs requires substantial memory due to the storage requirements of Key-Value embeddings in the KV cache.
Traditional quantization methods face significant memory overhead due to the need to store quantization constants.
We introduce QJL, a new quantization approach that consists of a Johnson-Lindenstrauss transform followed by sign-bit quantization.
arXiv  Detail & Related papers  (2024-06-05T17:42:05Z) - Unlocking Data-free Low-bit Quantization with Matrix Decomposition for   KV Cache Compression [87.5604418100301]
Key-value( KV) caching is an important technique to accelerate the inference of large language models.
Existing methods often compromise precision or require extra data for calibration.
We introduce textbfDecoQuant, a novel data-free low-bit quantization technique based on tensor decomposition methods.
arXiv  Detail & Related papers  (2024-05-21T08:35:10Z) - KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache   Quantization [67.74400574357472]
LLMs are seeing growing use for applications which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference.
 Quantization is a promising approach for compressing KV cache activations; however, existing solutions fail to represent activations accurately in sub-4-bit precision.
Our work, KVQuant, facilitates low precision KV cache quantization by incorporating several novel methods.
arXiv  Detail & Related papers  (2024-01-31T18:58:14Z) 
        This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.