Related papers: Don't be so Stief! Learning KV Cache low-rank approximation over the Stiefel manifold

Don't be so Stief! Learning KV Cache low-rank approximation over the Stiefel manifold

URL: http://arxiv.org/abs/2601.21686v1
Date: Thu, 29 Jan 2026 13:19:24 GMT
Title: Don't be so Stief! Learning KV Cache low-rank approximation over the Stiefel manifold
Authors: Luca Benfenati, Matteo Risso, Andrea Vannozzi, Ahmet Caner Yüzügüler, Lukas Cavigelli, Enrico Macii, Daniele Jahier Pagliari, Alessio Burrello,
Abstract summary: StiefAttention is a KV-cache compression method that learns emphorthonormal projection bases by directly minimizing output reconstruction error.<n>It outperforms EigenAttention by $11.9$ points on C4 perplexity and $5.4%$ on 0-shot MMLU accuracy at iso-compression, lower relative error and higher cosine similarity with respect to the original decoder-layer outputs.
Score: 7.162701793686856
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Key--value (KV) caching enables fast autoregressive decoding but at long contexts becomes a dominant bottleneck in High Bandwidth Memory (HBM) capacity and bandwidth. A common mitigation is to compress cached keys and values by projecting per-head matrixes to a lower rank, storing only the projections in the HBM. However, existing post-training approaches typically fit these projections using SVD-style proxy objectives, which may poorly reflect end-to-end reconstruction after softmax, value mixing, and subsequent decoder-layer transformations. For these reasons, we introduce StiefAttention, a post-training KV-cache compression method that learns \emph{orthonormal} projection bases by directly minimizing \emph{decoder-layer output reconstruction error}. StiefAttention additionally precomputes, for each layer, an error-rank profile over candidate ranks, enabling flexible layer-wise rank allocation under a user-specified error budget. Noteworthy, on Llama3-8B under the same conditions, StiefAttention outperforms EigenAttention by $11.9$ points on C4 perplexity and $5.4\%$ on 0-shot MMLU accuracy at iso-compression, yielding lower relative error and higher cosine similarity with respect to the original decoder-layer outputs.

Related papers

Zero Sum SVD: Balancing Loss Sensitivity for Low Rank LLM Compression [11.908793753919745]
We propose textbfZero Sum SVD (textbfZS-SVD), a post-training method that performs singular component selection in whitened coordinates.<n>textbfZS-SVD prunes components across the whole model with a textbfzero sum rule that keeps the cumulative predicted loss change near zero.<n>Experiments show consistent gains across diverse benchmarks and compression ratios.
arXiv Detail & Related papers (2026-02-02T21:51:01Z)
KQ-SVD: Compressing the KV Cache with Provable Guarantees on Attention Fidelity [6.542188603141656]
Key-Value cache is central to the efficiency of large language models.<n>As sequence length and batch size grow, the cache becomes a major memory bottleneck.<n>We introduce KQ-SVD, a simple and computationally efficient method that directly performs an optimal low-rank decomposition of the attention matrix.
arXiv Detail & Related papers (2025-12-05T17:51:10Z)
The Lossy Horizon: Error-Bounded Predictive Coding for Lossy Text Compression (Episode I) [6.453417258264177]
This paper introduces Error-Bounded Predictive Coding ( EPC), a lossy text that leverages a Masked Language Model (MLM) as a decompressor.<n>Instead of storing a subset of original tokens, EPC allows the model to predict masked content and stores minimal, rank-based corrections only when the model's top prediction is incorrect.<n>We demonstrate that EPC consistently dominates Predictive Masking, offering superior fidelity at a significantly lower bit rate by more efficiently utilising the model's intrinsic knowledge.
arXiv Detail & Related papers (2025-10-25T08:18:31Z)
OjaKV: Context-Aware Online Low-Rank KV Cache Compression with Oja's Rule [54.37983890753086]
We introduce OjaKV, a framework that integrates a strategic hybrid storage policy with online subspace adaptation.<n>OjaKV preserves crucial first and most recent tokens in full-rank, maintaining high-fidelity anchors for attention.<n>It applies low-rank compression by incrementally adapting the projection basis using Oja's algorithm for online principal component analysis.
arXiv Detail & Related papers (2025-09-25T21:42:27Z)
KV-Latent: Dimensional-level KV Cache Reduction with Frequency-aware Rotary Positional Embedding [72.12756830560217]
Large language models (LLMs) based on Transformer Decoders have become the preferred choice for conversational generative AI.<n>Despite the overall superiority of the Decoder architecture, the gradually increasing Key-Value cache during inference has emerged as a primary efficiency bottleneck.<n>By down-sampling the Key-Value vector dimensions into a latent space, we can significantly reduce the KV Cache footprint and improve inference speed.
arXiv Detail & Related papers (2025-07-15T12:52:12Z)
ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration [69.57122277845293]
We propose ReCalKV, a post-training low-rank KV cache compression approach with tailored strategies for Keys and Values.<n>For Keys, we propose Similarity aware Recontext (HSR), which clusters structurally similar heads into groups, enabling more accurate low-rank approximation.<n>For Values, we propose Offline Head-wise Value (OVC), which efficiently calibrates the value projection matrix using calibration data without training.
arXiv Detail & Related papers (2025-05-30T08:49:27Z)
QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache [67.84112700032007]
Large Language Models (LLMs) are increasingly being deployed on edge devices for long-context settings.<n>In these scenarios, the Key-Value ( KV) cache is the primary bottleneck in terms of both GPU memory and latency.<n>We propose a novel self-speculative decoding framework, QuantSpec, where the draft model shares the architecture of the target model but employs a hierarchical 4-bit quantized KV cache and 4-bit quantized weights for acceleration.
arXiv Detail & Related papers (2025-02-05T20:43:48Z)
PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation [97.41972925670508]
Large vision-language models (LVLMs) incur significant computational and memory overhead during inference.<n>We present PrefixKV, where "Prefix" means the top-ranked KV based on importance rather than position in the original sequence.<n>Our method achieves the state-of-the-art performance compared with others.
arXiv Detail & Related papers (2024-12-04T15:48:59Z)
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs) Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time. We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.