Related papers: SQuat: Subspace-orthogonal KV Cache Quantization

SQuat: Subspace-orthogonal KV Cache Quantization

URL: http://arxiv.org/abs/2503.24358v1
Date: Mon, 31 Mar 2025 17:37:32 GMT
Title: SQuat: Subspace-orthogonal KV Cache Quantization
Authors: Hao Wang, Ligong Han, Kai Xu, Akash Srivastava,
Abstract summary: We introduce SQuat (Subspace-orthogonal KV cache quantization), which reduces peak memory by 2.17 to 2.82, improves throughput by 2.45 to 3.60, and achieves more favorable benchmark scores than existing KV cache quantization algorithms.<n>We show that our method reduces peak memory by 2.17 to 2.82, improves throughput by 2.45 to 3.60, and achieves more favorable benchmark scores than existing KV cache quantization algorithms.
Score: 19.131705063324883
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The key-value (KV) cache accelerates LLMs decoding by storing KV tensors from previously generated tokens. It reduces redundant computation at the cost of increased memory usage. To mitigate this overhead, existing approaches compress KV tensors into lower-bit representations; however, quantization errors can accumulate as more tokens are generated, potentially resulting in undesired outputs. In this paper, we introduce SQuat (Subspace-orthogonal KV cache quantization). It first constructs a subspace spanned by query tensors to capture the most critical task-related information. During key tensor quantization, it enforces that the difference between the (de)quantized and original keys remains orthogonal to this subspace, minimizing the impact of quantization errors on the attention mechanism's outputs. SQuat requires no model fine-tuning, no additional calibration dataset for offline learning, and is grounded in a theoretical framework we develop. Through numerical experiments, we show that our method reduces peak memory by 2.17 to 2.82, improves throughput by 2.45 to 3.60, and achieves more favorable benchmark scores than existing KV cache quantization algorithms.

Related papers

InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models [4.4248984733976275]
InnerQ is a hardware-aware KV-cache quantization scheme that decodes latency without sacrificing accuracy.<n>It applies group-wise quantization while grouping the cache matrices over their inner dimension.<n>Our evaluation experiments on Llama models shows that InnerQ maintains a few-shot GSM8K performance comparable to non-quantized KV caches.
arXiv Detail & Related papers (2026-02-26T16:50:36Z)
VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization [23.781285860723248]
Key-Value ( KV) cache introduces memory overhead during large language model (LLM) inference.<n>We propose VecInfer, a novel VQ method for aggressive KV cache compression while enabling efficient inference.<n>VecInfer consistently outperforms existing quantization baselines across both long-context understanding and mathematical reasoning tasks.
arXiv Detail & Related papers (2025-10-07T17:35:28Z)
KV-Latent: Dimensional-level KV Cache Reduction with Frequency-aware Rotary Positional Embedding [72.12756830560217]
Large language models (LLMs) based on Transformer Decoders have become the preferred choice for conversational generative AI.<n>Despite the overall superiority of the Decoder architecture, the gradually increasing Key-Value cache during inference has emerged as a primary efficiency bottleneck.<n>By down-sampling the Key-Value vector dimensions into a latent space, we can significantly reduce the KV Cache footprint and improve inference speed.
arXiv Detail & Related papers (2025-07-15T12:52:12Z)
AnTKV: Anchor Token-Aware Sub-Bit Vector Quantization for KV Cache in Large Language Models [27.605195979962474]
Quantization has emerged as an effective and lightweight solution to reduce the memory footprint of the KV cache in Large Language Models.<n>We propose AnTKV, a dual-stage framework that leverages anchor token-aware vector quantization to compress the KV cache.<n>Experiments demonstrate that AnTKV matches or surpasses prior methods at 4-bit, and significantly reduce perplexity under ultra-low-bit quantization.
arXiv Detail & Related papers (2025-06-24T10:45:48Z)
CommVQ: Commutative Vector Quantization for KV Cache Compression [50.37946553931796]
We propose Commutative Vector Quantization (CommVQ) to significantly reduce memory usage for long-context LLM inference.<n>We first introduce additive quantization with a lightweight encoder and codebook to compress the KV cache.<n>Our approach achieves high accuracy with additive quantization and low overhead via the RoPE-commutative codebook.
arXiv Detail & Related papers (2025-06-23T17:50:11Z)
ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration [81.81027217759433]
Large language models (LLMs) are often constrained by the excessive memory required to store the Key-Value ( KV) cache.<n>Recent methods have explored reducing the hidden dimensions of the KV cache, but many introduce additional computation through projection layers.<n>We propose ReCalKV, a post-training KV cache compression method that reduces the hidden dimensions of the KV cache.
arXiv Detail & Related papers (2025-05-30T08:49:27Z)
NQKV: A KV Cache Quantization Scheme Based on Normal Distribution Characteristics [6.048883141729117]
Large Language Models (LLMs) have demonstrated remarkable proficiency across a wide range of tasks.<n>LLMs often require larger batch sizes to enhance throughput or longer context lengths to meet task demands.
arXiv Detail & Related papers (2025-05-22T04:23:19Z)
Accurate KV Cache Quantization with Outlier Tokens Tracing [44.722738059962296]
KV Cache quantization presents a promising solution, striking a good balance between memory usage and accuracy.<n>Previous research has shown that the Keys are distributed by channel, while the Values are distributed by token.<n>Our method achieves significant accuracy improvements under 2-bit quantization and can deliver a 6.4 times reduction in memory usage and a 2.3 times increase in throughput.
arXiv Detail & Related papers (2025-05-16T07:23:12Z)
More for Keys, Less for Values: Adaptive KV Cache Quantization [59.708443710731146]
This paper introduces an information-aware quantization framework that adaptively compresses the key-value cache in large language models.<n>We show that key matrices consistently exhibit higher norm values and are more sensitive to quantization than value matrices.<n>We propose a mixed-precision quantization strategy, KV-AdaQuant, which allocates more bitwidth for keys and fewer for values.
arXiv Detail & Related papers (2025-02-20T22:24:27Z)
More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression [71.42818367729573]
In large language models (LLMs), the memory usage of KV cache has become a critical bottleneck during inference. The mainstream KV compression methods, including KV pruning and KV quantization, primarily focus on either token or precision dimension separately. In this paper, we comprehensively investigate the token-precision trade-off in KV cache compression.
arXiv Detail & Related papers (2024-12-17T09:20:31Z)
ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications. This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference. We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead [10.067037913589175]
Serving LLMs requires substantial memory due to the storage requirements of Key-Value embeddings in the KV cache. Traditional quantization methods face significant memory overhead due to the need to store quantization constants. We introduce QJL, a new quantization approach that consists of a Johnson-Lindenstrauss transform followed by sign-bit quantization.
arXiv Detail & Related papers (2024-06-05T17:42:05Z)
Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression [87.5604418100301]
Key-value( KV) caching is an important technique to accelerate the inference of large language models. Existing methods often compromise precision or require extra data for calibration. We introduce textbfDecoQuant, a novel data-free low-bit quantization technique based on tensor decomposition methods.
arXiv Detail & Related papers (2024-05-21T08:35:10Z)
Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference [2.8241099113277666]
"Keyformer" is an innovative inference-time approach to mitigate the challenges associated with KV cache size and memory bandwidth utilization. We evaluate Keyformer's performance across three foundational models: GPT-J, Cerebras-GPT, and MPT.
arXiv Detail & Related papers (2024-03-14T02:42:42Z)
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache [67.9776980972508]
We develop a tuning-free 2bit KV cache quantization algorithm named KIVI. KIVI can enable Llama, Falcon, and Mistral models to maintain almost the same quality while using $mathbf2.6times$ less peak memory.
arXiv Detail & Related papers (2024-02-05T06:06:47Z)
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization [67.74400574357472]
LLMs are seeing growing use for applications which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference. Quantization is a promising approach for compressing KV cache activations; however, existing solutions fail to represent activations accurately in sub-4-bit precision. Our work, KVQuant, facilitates low precision KV cache quantization by incorporating several novel methods.
arXiv Detail & Related papers (2024-01-31T18:58:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.