Related papers: MixKVQ: Query-Aware Mixed-Precision KV Cache Quantization for Long-Context Reasoning

MixKVQ: Query-Aware Mixed-Precision KV Cache Quantization for Long-Context Reasoning

URL: http://arxiv.org/abs/2512.19206v1
Date: Mon, 22 Dec 2025 09:44:26 GMT
Title: MixKVQ: Query-Aware Mixed-Precision KV Cache Quantization for Long-Context Reasoning
Authors: Tao Zhang, Ziqian Zeng, Hao Peng, Huiping Zhuang, Cen Chen,
Abstract summary: Long Chain-of-Thought (CoT) reasoning has significantly advanced the capabilities of Large Language Models (LLMs)<n>Existing low-bit quantization methods often exhibit severe performance degradation on complex reasoning tasks.<n>We propose MixKVQ, a novel plug-and-play method that introduces a lightweight, query-aware algorithm to identify and preserve critical key channels.
Score: 30.527521568636242
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Long Chain-of-Thought (CoT) reasoning has significantly advanced the capabilities of Large Language Models (LLMs), but this progress is accompanied by substantial memory and latency overhead from the extensive Key-Value (KV) cache. Although KV cache quantization is a promising compression technique, existing low-bit quantization methods often exhibit severe performance degradation on complex reasoning tasks. Fixed-precision quantization struggles to handle outlier channels in the key cache, while current mixed-precision strategies fail to accurately identify components requiring high-precision representation. We find that an effective low-bit KV cache quantization strategy must consider two factors: a key channel's intrinsic quantization difficulty and its relevance to the query. Based on this insight, we propose MixKVQ, a novel plug-and-play method that introduces a lightweight, query-aware algorithm to identify and preserve critical key channels that need higher precision, while applying per-token quantization for value cache. Experiments on complex reasoning datasets demonstrate that our approach significantly outperforms existing low-bit methods, achieving performance comparable to a full-precision baseline at a substantially reduced memory footprint.

Related papers

Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction [50.99402504483692]
We propose a novel gating-based KV cache eviction method for frozen-weight language models.<n>Our approach integrates seamlessly into both the prefill and decoding stages.<n>Experiments show that our method maintains near-lossless performance while evicting up to 70% of the KV cache.
arXiv Detail & Related papers (2026-01-25T03:07:54Z)
XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression [54.28208936996186]
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks.<n> Quantization has emerged as a promising solution to reduce memory consumption while preserving historical information.<n>We propose XQuant, a training-free and plug-and-play framework that achieves ultra-low equivalent bit-width KV cache quantization.
arXiv Detail & Related papers (2025-10-13T10:17:21Z)
TaDA: Training-free recipe for Decoding with Adaptive KV Cache Compression and Mean-centering [10.427881558469442]
We introduce TaDA, a training-free recipe for KV cache compression with quantization precision.<n>Our approach yields substantial accuracy improvements for multiple models supporting various context lengths.<n>Our method paves the way for scalable and high-performance reasoning in language models.
arXiv Detail & Related papers (2025-06-05T05:23:38Z)
ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration [69.57122277845293]
We propose ReCalKV, a post-training low-rank KV cache compression approach with tailored strategies for Keys and Values.<n>For Keys, we propose Similarity aware Recontext (HSR), which clusters structurally similar heads into groups, enabling more accurate low-rank approximation.<n>For Values, we propose Offline Head-wise Value (OVC), which efficiently calibrates the value projection matrix using calibration data without training.
arXiv Detail & Related papers (2025-05-30T08:49:27Z)
Cocktail: Chunk-Adaptive Mixed-Precision Quantization for Long-Context LLM Inference [24.184349246524587]
Cocktail employs chunk-adaptive mixed-precision quantization to optimize the KV cache.<n>Chunk-level quantization search determines the optimal bitwidth configuration of the KV cache chunks.<n>Cocktail outperforms state-of-the-art KV cache quantization methods on various models and datasets.
arXiv Detail & Related papers (2025-03-30T03:20:34Z)
SVDq: 1.25-bit and 410x Key Cache Compression for LLM Attention [0.0]
Three main types of KV cache compression techniques, namely sparsity, channel compression, and quantization, have been identified.<n>This study presents SVDq, a Singular Value Decomposition (SVD) - based mixed precision quantization method for K cache.
arXiv Detail & Related papers (2025-02-21T08:55:21Z)
More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression [71.42818367729573]
In large language models (LLMs), the memory usage of KV cache has become a critical bottleneck during inference.<n>The mainstream KV compression methods, including KV pruning and KV quantization, primarily focus on either token or precision dimension separately.<n>In this paper, we comprehensively investigate the token-precision trade-off in KV cache compression.
arXiv Detail & Related papers (2024-12-17T09:20:31Z)
Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression [87.5604418100301]
Key-value( KV) caching is an important technique to accelerate the inference of large language models. Existing methods often compromise precision or require extra data for calibration. We introduce textbfDecoQuant, a novel data-free low-bit quantization technique based on tensor decomposition methods.
arXiv Detail & Related papers (2024-05-21T08:35:10Z)
KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization [34.824534775022144]
We propose Coupled Quantization (CQ) as a technique for KV cache compression. CQ couples multiple key/value channels together to exploit their inter-dependency and encode the activations in a more information-efficient manner. We demonstrate that CQ can preserve model quality with KV cache quantized down to 1-bit.
arXiv Detail & Related papers (2024-05-07T00:25:20Z)
QAQ: Quality Adaptive Quantization for LLM KV Cache [3.163526369095745]
A bottleneck in model deployment emerges due to the linear expansion of the Key-Value cache with the context length. We propose QAQ, a Quality Adaptive Quantization scheme for the KV cache.
arXiv Detail & Related papers (2024-03-07T16:42:37Z)
SubGen: Token Generation in Sublinear Time and Memory [48.35076900702408]
Large language models (LLMs) have extensive memory requirements for token generation. In this work, we focus on developing an efficient compression technique for the KV cache. We have devised a novel caching method with sublinear complexity, employing online clustering on key tokens and online $ell$ sampling on values. Not only does this algorithm ensure a sublinear memory footprint and sublinear time complexity, but we also establish a tight error bound for our approach.
arXiv Detail & Related papers (2024-02-08T22:17:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.