PatternKV: Flattening KV Representation Expands Quantization Headroom
- URL: http://arxiv.org/abs/2510.05176v1
- Date: Sun, 05 Oct 2025 12:09:14 GMT
- Title: PatternKV: Flattening KV Representation Expands Quantization Headroom
- Authors: Ji Zhang, Yiwei Li, Shaoxiong Feng, Peiwen Yuan, Xinglin Wang, Jiayi Shi, Yueqi Zhang, Chuyi Tan, Boyuan Pan, Yao Hu, Kan Li,
- Abstract summary: KV cache in autoregressive LLMs eliminates redundant recomputation but has emerged as the dominant memory and bandwidth bottleneck during inference.<n> KV quantization is a key lever for reducing cache cost, but accuracy drops sharply as the native KV distribution lacks flatness.<n>We show that the K cache maintains a stable structure that evolves gradually with context, while the V cache carries latent semantic regularities.
- Score: 37.83913102876393
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: KV cache in autoregressive LLMs eliminates redundant recomputation but has emerged as the dominant memory and bandwidth bottleneck during inference, notably with long contexts and test-time scaling. KV quantization is a key lever for reducing cache cost, but accuracy drops sharply as the native KV distribution lacks flatness and thus maintains a wide quantization range. Prior work focuses on isolating outliers, which caps their error but fails to flatten the overall distribution, leaving performance fragile under low-bit settings. In this work, we show that the K cache maintains a stable structure that evolves gradually with context, while the V cache carries latent semantic regularities. Building on these insights, we propose PatternKV, a pattern-aligned residual quantization scheme. It mines representative pattern vectors online, aligns each KV vector to its nearest pattern, and quantizes only the residual. This reshaping of the KV distribution flattens the quantization target and narrows its range, thereby improving the fidelity of low-bit KV quantization. Across long-context and test-time scaling settings on multiple backbones, PatternKV delivers consistent 2-bit gains, with a 0.08% average 4-bit drop relative to FP16, improves test-time scaling accuracy by 10% on average, and raises throughput by 1.4x while supporting 1.25x larger batches.
Related papers
- KVLinC : KV Cache Quantization with Hadamard Rotation and Linear Correction [8.486713415198968]
We propose KVLinC, a framework to mitigate attention errors introduced by KV cache quantization.<n> KVLinC combines a Hadamard rotation, which reduces quantization error in values, with lightweight linear correction adapters.<n>We show that KVLinC consistently matches or surpasses strong baselines while achieving higher KV-cache compression.
arXiv Detail & Related papers (2025-10-06T21:08:11Z) - R-KV: Redundancy-aware KV Cache Compression for Reasoning Models [77.84539432982307]
We propose Redundancy-aware KV Cache Compression for Reasoning models (R-KV)<n>R-KV preserves nearly 100% of the full KV cache performance using only 10% of the KV cache.<n>Remarkably, R-KV even achieves 105% of full KV cache performance with 16% of the KV cache.
arXiv Detail & Related papers (2025-05-30T02:03:24Z) - NQKV: A KV Cache Quantization Scheme Based on Normal Distribution Characteristics [6.048883141729117]
Large Language Models (LLMs) have demonstrated remarkable proficiency across a wide range of tasks.<n>LLMs often require larger batch sizes to enhance throughput or longer context lengths to meet task demands.
arXiv Detail & Related papers (2025-05-22T04:23:19Z) - KVCrush: Key value cache size-reduction using similarity in head-behaviour [40.792661186062396]
Key-value (KV) caching has emerged as a crucial optimization technique for accelerating inference in large language models (LLMs)<n>However, the memory footprint of the KV is a huge bottleneck for model deployment directly impacting the model's batch size.<n>We propose KVCrush which can be combined with many KV compression technologies to improve the model accuracy at a much smaller memory.
arXiv Detail & Related papers (2025-02-24T02:57:51Z) - More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression [71.42818367729573]
In large language models (LLMs), the memory usage of KV cache has become a critical bottleneck during inference.<n>The mainstream KV compression methods, including KV pruning and KV quantization, primarily focus on either token or precision dimension separately.<n>In this paper, we comprehensively investigate the token-precision trade-off in KV cache compression.
arXiv Detail & Related papers (2024-12-17T09:20:31Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression [87.5604418100301]
Key-value( KV) caching is an important technique to accelerate the inference of large language models.
Existing methods often compromise precision or require extra data for calibration.
We introduce textbfDecoQuant, a novel data-free low-bit quantization technique based on tensor decomposition methods.
arXiv Detail & Related papers (2024-05-21T08:35:10Z) - No Token Left Behind: Reliable KV Cache Compression via Importance-Aware
Mixed Precision Quantization [31.806112535762367]
Key-Value (KV) Caching has become an essential technique for accelerating the inference speed and throughput of generative Large Language Models(LLMs)
arXiv Detail & Related papers (2024-02-28T06:34:54Z) - KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization [67.74400574357472]
LLMs are seeing growing use for applications which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference.<n> Quantization is a promising approach for compressing KV cache activations; however, existing solutions fail to represent activations accurately in sub-4-bit precision.<n>Our work, KVQuant, facilitates low precision KV cache quantization by incorporating several novel methods.
arXiv Detail & Related papers (2024-01-31T18:58:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.