KVLinC : KV Cache Quantization with Hadamard Rotation and Linear Correction
- URL: http://arxiv.org/abs/2510.05373v1
- Date: Mon, 06 Oct 2025 21:08:11 GMT
- Title: KVLinC : KV Cache Quantization with Hadamard Rotation and Linear Correction
- Authors: Utkarsh Saxena, Kaushik Roy,
- Abstract summary: We propose KVLinC, a framework to mitigate attention errors introduced by KV cache quantization.<n> KVLinC combines a Hadamard rotation, which reduces quantization error in values, with lightweight linear correction adapters.<n>We show that KVLinC consistently matches or surpasses strong baselines while achieving higher KV-cache compression.
- Score: 8.486713415198968
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Quantizing the key-value (KV) cache is a promising strategy for improving the inference efficiency of large language models (LLMs). However, aggressive quantization to very low precision (e.g., 2 bits) introduces significant errors in the stored key and value tensors, which propagate through the dot-product attention mechanism and ultimately degrade generation quality. To address this, we propose KVLinC, a framework to mitigate attention errors introduced by KV cache quantization in the extreme low-precision regime. KVLinC combines a Hadamard rotation, which reduces quantization error in values, with lightweight linear correction adapters that explicitly compensate for errors introduced by quantized keys. Across extensive evaluations on the LLaMA, Qwen2.5, and Qwen3 model families, KVLinC consistently matches or surpasses strong baselines while achieving higher KV-cache compression. Furthermore, we implement a custom attention kernel that results in upto 2.55x faster inference compared to Flash Attention baseline, enabling efficient long-context LLM inference.
Related papers
- SmallKV: Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference [71.20542521694524]
SmallKV is a small model assisted compensation method for KV cache compression.<n>We show that SmallKV achieves 1.75 - 2.56 times higher throughput than baseline methods.
arXiv Detail & Related papers (2025-08-03T09:15:36Z) - KV-Latent: Dimensional-level KV Cache Reduction with Frequency-aware Rotary Positional Embedding [72.12756830560217]
Large language models (LLMs) based on Transformer Decoders have become the preferred choice for conversational generative AI.<n>Despite the overall superiority of the Decoder architecture, the gradually increasing Key-Value cache during inference has emerged as a primary efficiency bottleneck.<n>By down-sampling the Key-Value vector dimensions into a latent space, we can significantly reduce the KV Cache footprint and improve inference speed.
arXiv Detail & Related papers (2025-07-15T12:52:12Z) - ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration [69.57122277845293]
We propose ReCalKV, a post-training low-rank KV cache compression approach with tailored strategies for Keys and Values.<n>For Keys, we propose Similarity aware Recontext (HSR), which clusters structurally similar heads into groups, enabling more accurate low-rank approximation.<n>For Values, we propose Offline Head-wise Value (OVC), which efficiently calibrates the value projection matrix using calibration data without training.
arXiv Detail & Related papers (2025-05-30T08:49:27Z) - KVmix: Gradient-Based Layer Importance-Aware Mixed-Precision Quantization for KV Cache [13.662270631753135]
Quantization can effectively alleviate the memory pressure caused by KV Cache.<n>We propose a novel mixed-precision quantization method for KV Cache named KVmix.
arXiv Detail & Related papers (2025-05-18T07:04:53Z) - KVCrush: Key value cache size-reduction using similarity in head-behaviour [40.792661186062396]
Key-value (KV) caching has emerged as a crucial optimization technique for accelerating inference in large language models (LLMs)<n>However, the memory footprint of the KV is a huge bottleneck for model deployment directly impacting the model's batch size.<n>We propose KVCrush which can be combined with many KV compression technologies to improve the model accuracy at a much smaller memory.
arXiv Detail & Related papers (2025-02-24T02:57:51Z) - Quantize What Counts: Bit Allocation Insights Informed by Spectral Gaps in Keys and Values [57.54443445583921]
We provide two novel theorems aimed at enhancing KV quantization methods.<n>Our first theorem, termed Key-Value Norm Disparity, states that the key weight matrices by nature carry richer information.<n>Our second theorem, Key-Driven Quantization, posits that prioritizing the quantization precision of keys over values induces significant improvements to the overall quantization performance.
arXiv Detail & Related papers (2025-02-20T22:24:27Z) - KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference [40.97781175723418]
KV cache quantization can improve Large Language Models inference throughput and latency in long contexts.<n>Current methods have three unsolved issues: overlooking layer-wise sensitivity to KV cache quantization, high overhead of online fine-grained decision-making, and low flexibility to different LLMs and constraints.<n>We propose a simple yet effective framework KVTuner to adaptively search for the optimal hardware-friendly layer-wise KV quantization precision pairs for coarse-grained KV cache.
arXiv Detail & Related papers (2025-02-06T15:26:26Z) - More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression [71.42818367729573]
In large language models (LLMs), the memory usage of KV cache has become a critical bottleneck during inference.<n>The mainstream KV compression methods, including KV pruning and KV quantization, primarily focus on either token or precision dimension separately.<n>In this paper, we comprehensively investigate the token-precision trade-off in KV cache compression.
arXiv Detail & Related papers (2024-12-17T09:20:31Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression [87.5604418100301]
Key-value( KV) caching is an important technique to accelerate the inference of large language models.
Existing methods often compromise precision or require extra data for calibration.
We introduce textbfDecoQuant, a novel data-free low-bit quantization technique based on tensor decomposition methods.
arXiv Detail & Related papers (2024-05-21T08:35:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.