Related papers: KQ-SVD: Compressing the KV Cache with Provable Guarantees on Attention Fidelity

KQ-SVD: Compressing the KV Cache with Provable Guarantees on Attention Fidelity

URL: http://arxiv.org/abs/2512.05916v1
Date: Fri, 05 Dec 2025 17:51:10 GMT
Title: KQ-SVD: Compressing the KV Cache with Provable Guarantees on Attention Fidelity
Authors: Damien Lesens, Beheshteh T. Rakhshan, Guillaume Rabusseau,
Abstract summary: Key-Value cache is central to the efficiency of large language models.<n>As sequence length and batch size grow, the cache becomes a major memory bottleneck.<n>We introduce KQ-SVD, a simple and computationally efficient method that directly performs an optimal low-rank decomposition of the attention matrix.
Score: 6.542188603141656
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Key-Value (KV) cache is central to the efficiency of transformer-based large language models (LLMs), storing previously computed vectors to accelerate inference. Yet, as sequence length and batch size grow, the cache becomes a major memory bottleneck. Prior compression methods typically apply low-rank decomposition to keys alone or attempt to jointly embed queries and keys, but both approaches neglect that attention fundamentally depends on their inner products. In this work, we prove that such strategies are suboptimal for approximating the attention matrix. We introduce KQ-SVD, a simple and computationally efficient method that directly performs an optimal low-rank decomposition of the attention matrix via a closed-form solution. By targeting the true source of redundancy, KQ-SVD preserves attention outputs with higher fidelity under compression. Extensive evaluations on LLaMA and Mistral models demonstrate that our approach consistently delivers superior projection quality.

Related papers

Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction [50.99402504483692]
We propose a novel gating-based KV cache eviction method for frozen-weight language models.<n>Our approach integrates seamlessly into both the prefill and decoding stages.<n>Experiments show that our method maintains near-lossless performance while evicting up to 70% of the KV cache.
arXiv Detail & Related papers (2026-01-25T03:07:54Z)
Value-Guided KV Compression for LLMs via Approximated CUR Decomposition [24.262712463465665]
CurDKV is a novel, value-centric KV compression method that selects keys and values based on leverage scores computed from CUR matrix decomposition.<n>Our approach approximates the dominant subspace of the attention output $softmax(QKT)V$, ensuring that the retained tokens best preserve the model's predictive behavior.
arXiv Detail & Related papers (2025-09-18T15:04:06Z)
ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration [69.57122277845293]
We propose ReCalKV, a post-training low-rank KV cache compression approach with tailored strategies for Keys and Values.<n>For Keys, we propose Similarity aware Recontext (HSR), which clusters structurally similar heads into groups, enabling more accurate low-rank approximation.<n>For Values, we propose Offline Head-wise Value (OVC), which efficiently calibrates the value projection matrix using calibration data without training.
arXiv Detail & Related papers (2025-05-30T08:49:27Z)
AttentionPredictor: Temporal Patterns Matter for KV Cache Compression [64.75459635661562]
We propose AttentionPredictor, which is the first learning-based method to directly predict attention patterns for KV cache compression and critical token identification.<n> AttentionPredictor accurately predicts the attention score and shares the unified prediction model, which consumes negligible memory.<n>By retaining most of the attention information, AttentionPredictor achieves 13$times$ KV cache compression and 5.6$times$ speedup in a cache offloading scenario.
arXiv Detail & Related papers (2025-02-06T13:41:46Z)
MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection [14.073722038551125]
KV cache has become a de facto technique for the inference of large language models.<n>This paper uses low-rank projection matrices to transform the cache features into spaces with reduced dimensions.<n>We find that our method can sustain over 90% performance with an average KV cache compression rate of 60%.
arXiv Detail & Related papers (2024-10-16T08:34:51Z)
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs) Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time. We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z)
ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression [13.981807478365452]
Existing approaches to reduce the Key-Value cache size involve either fine-tuning the model to learn a compression strategy or leveraging attention scores to reduce the sequence length. We find a clear correlation between the $L$ and the attention scores over cached KV pairs, where a low $L$ of a key embedding leads to a high attention score during decoding. Our experimental results show that this simple strategy can reduce the KV cache size by 50% on language modelling and needle-in-a-haystack tasks and 90% on passkey retrieval tasks without losing accuracy.
arXiv Detail & Related papers (2024-06-17T11:35:16Z)
Effectively Compress KV Heads for LLM [28.0801697946958]
We propose a novel approach for compressing Key-Value ( KV) caches. Our method can compress half or even three-quarters of KV heads while maintaining performance comparable to the original LLMs.
arXiv Detail & Related papers (2024-06-11T08:37:33Z)
Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression [87.5604418100301]
Key-value( KV) caching is an important technique to accelerate the inference of large language models. Existing methods often compromise precision or require extra data for calibration. We introduce textbfDecoQuant, a novel data-free low-bit quantization technique based on tensor decomposition methods.
arXiv Detail & Related papers (2024-05-21T08:35:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.