MixKVQ: Query-Aware Mixed-Precision KV Cache Quantization for Long-Context Reasoning
- URL: http://arxiv.org/abs/2512.19206v1
- Date: Mon, 22 Dec 2025 09:44:26 GMT
- Title: MixKVQ: Query-Aware Mixed-Precision KV Cache Quantization for Long-Context Reasoning
- Authors: Tao Zhang, Ziqian Zeng, Hao Peng, Huiping Zhuang, Cen Chen,
- Abstract summary: Long Chain-of-Thought (CoT) reasoning has significantly advanced the capabilities of Large Language Models (LLMs)<n>Existing low-bit quantization methods often exhibit severe performance degradation on complex reasoning tasks.<n>We propose MixKVQ, a novel plug-and-play method that introduces a lightweight, query-aware algorithm to identify and preserve critical key channels.
- Score: 30.527521568636242
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Long Chain-of-Thought (CoT) reasoning has significantly advanced the capabilities of Large Language Models (LLMs), but this progress is accompanied by substantial memory and latency overhead from the extensive Key-Value (KV) cache. Although KV cache quantization is a promising compression technique, existing low-bit quantization methods often exhibit severe performance degradation on complex reasoning tasks. Fixed-precision quantization struggles to handle outlier channels in the key cache, while current mixed-precision strategies fail to accurately identify components requiring high-precision representation. We find that an effective low-bit KV cache quantization strategy must consider two factors: a key channel's intrinsic quantization difficulty and its relevance to the query. Based on this insight, we propose MixKVQ, a novel plug-and-play method that introduces a lightweight, query-aware algorithm to identify and preserve critical key channels that need higher precision, while applying per-token quantization for value cache. Experiments on complex reasoning datasets demonstrate that our approach significantly outperforms existing low-bit methods, achieving performance comparable to a full-precision baseline at a substantially reduced memory footprint.
Related papers
- Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction [50.99402504483692]
We propose a novel gating-based KV cache eviction method for frozen-weight language models.<n>Our approach integrates seamlessly into both the prefill and decoding stages.<n>Experiments show that our method maintains near-lossless performance while evicting up to 70% of the KV cache.
arXiv Detail & Related papers (2026-01-25T03:07:54Z) - XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression [54.28208936996186]
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks.<n> Quantization has emerged as a promising solution to reduce memory consumption while preserving historical information.<n>We propose XQuant, a training-free and plug-and-play framework that achieves ultra-low equivalent bit-width KV cache quantization.
arXiv Detail & Related papers (2025-10-13T10:17:21Z) - TaDA: Training-free recipe for Decoding with Adaptive KV Cache Compression and Mean-centering [10.427881558469442]
We introduce TaDA, a training-free recipe for KV cache compression with quantization precision.<n>Our approach yields substantial accuracy improvements for multiple models supporting various context lengths.<n>Our method paves the way for scalable and high-performance reasoning in language models.
arXiv Detail & Related papers (2025-06-05T05:23:38Z) - ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration [69.57122277845293]
We propose ReCalKV, a post-training low-rank KV cache compression approach with tailored strategies for Keys and Values.<n>For Keys, we propose Similarity aware Recontext (HSR), which clusters structurally similar heads into groups, enabling more accurate low-rank approximation.<n>For Values, we propose Offline Head-wise Value (OVC), which efficiently calibrates the value projection matrix using calibration data without training.
arXiv Detail & Related papers (2025-05-30T08:49:27Z) - Cocktail: Chunk-Adaptive Mixed-Precision Quantization for Long-Context LLM Inference [24.184349246524587]
Cocktail employs chunk-adaptive mixed-precision quantization to optimize the KV cache.<n>Chunk-level quantization search determines the optimal bitwidth configuration of the KV cache chunks.<n>Cocktail outperforms state-of-the-art KV cache quantization methods on various models and datasets.
arXiv Detail & Related papers (2025-03-30T03:20:34Z) - SVDq: 1.25-bit and 410x Key Cache Compression for LLM Attention [0.0]
Three main types of KV cache compression techniques, namely sparsity, channel compression, and quantization, have been identified.<n>This study presents SVDq, a Singular Value Decomposition (SVD) - based mixed precision quantization method for K cache.
arXiv Detail & Related papers (2025-02-21T08:55:21Z) - More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression [71.42818367729573]
In large language models (LLMs), the memory usage of KV cache has become a critical bottleneck during inference.<n>The mainstream KV compression methods, including KV pruning and KV quantization, primarily focus on either token or precision dimension separately.<n>In this paper, we comprehensively investigate the token-precision trade-off in KV cache compression.
arXiv Detail & Related papers (2024-12-17T09:20:31Z) - Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression [87.5604418100301]
Key-value( KV) caching is an important technique to accelerate the inference of large language models.
Existing methods often compromise precision or require extra data for calibration.
We introduce textbfDecoQuant, a novel data-free low-bit quantization technique based on tensor decomposition methods.
arXiv Detail & Related papers (2024-05-21T08:35:10Z) - KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization [34.824534775022144]
We propose Coupled Quantization (CQ) as a technique for KV cache compression.
CQ couples multiple key/value channels together to exploit their inter-dependency and encode the activations in a more information-efficient manner.
We demonstrate that CQ can preserve model quality with KV cache quantized down to 1-bit.
arXiv Detail & Related papers (2024-05-07T00:25:20Z) - QAQ: Quality Adaptive Quantization for LLM KV Cache [3.163526369095745]
A bottleneck in model deployment emerges due to the linear expansion of the Key-Value cache with the context length.
We propose QAQ, a Quality Adaptive Quantization scheme for the KV cache.
arXiv Detail & Related papers (2024-03-07T16:42:37Z) - SubGen: Token Generation in Sublinear Time and Memory [48.35076900702408]
Large language models (LLMs) have extensive memory requirements for token generation.
In this work, we focus on developing an efficient compression technique for the KV cache.
We have devised a novel caching method with sublinear complexity, employing online clustering on key tokens and online $ell$ sampling on values.
Not only does this algorithm ensure a sublinear memory footprint and sublinear time complexity, but we also establish a tight error bound for our approach.
arXiv Detail & Related papers (2024-02-08T22:17:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.