Related papers: KV-CoRE: Benchmarking Data-Dependent Low-Rank Compressibility of KV-Caches in LLMs

KV-CoRE: Benchmarking Data-Dependent Low-Rank Compressibility of KV-Caches in LLMs

URL: http://arxiv.org/abs/2602.05929v2
Date: Sat, 07 Feb 2026 15:57:16 GMT
Title: KV-CoRE: Benchmarking Data-Dependent Low-Rank Compressibility of KV-Caches in LLMs
Authors: Jian Chen, Zhuoran Wang, Jiayu Qin, Ming Li, Meng Wang, Changyou Chen, Yin Chen, Qizhen Weng, Yirui Liu,
Abstract summary: Large language models rely on kv-caches to avoid redundant computation during autoregressive decoding.<n>As context length grows, reading and writing the cache can quickly saturate GPU memory bandwidth.<n>Recent work has explored KV-cache compression, yet most approaches neglect the data-dependent nature of kv-caches.
Score: 28.06342293292956
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models rely on kv-caches to avoid redundant computation during autoregressive decoding, but as context length grows, reading and writing the cache can quickly saturate GPU memory bandwidth. Recent work has explored KV-cache compression, yet most approaches neglect the data-dependent nature of kv-caches and their variation across layers. We introduce KV-CoRE KV-cache Compressibility by Rank Evaluation), an SVD-based method for quantifying the data-dependent low-rank compressibility of kv-caches. KV-CoRE computes the optimal low-rank approximation under the Frobenius norm and, being gradient-free and incremental, enables efficient dataset-level, layer-wise evaluation. Using this method, we analyze multiple models and datasets spanning five English domains and sixteen languages, uncovering systematic patterns that link compressibility to model architecture, training data, and language coverage. As part of this analysis, we employ the Normalized Effective Rank as a metric of compressibility and show that it correlates strongly with performance degradation under compression. Our study establishes a principled evaluation framework and the first large-scale benchmark of kv-cache compressibility in LLMs, offering insights for dynamic, data-aware compression and data-centric model development.

Related papers

Joint Encoding of KV-Cache Blocks for Scalable LLM Serving [3.3230675313521716]
Existing KV-cache compression methods rely on rigids, disrupt tensor layouts, or require specialized compute.<n>We propose joint encoding of KV-cache blocks, which fuses similar blocks across requests and input chunks into shared representations.<n>This alleviates the KV-cache memory bottleneck, supporting high-concurrency serving without specialized hardware.
arXiv Detail & Related papers (2026-01-06T14:50:58Z)
CommonKV: Compressing KV Cache with Cross-layer Parameter Sharing [54.34080239841088]
CommonKV is a training-free method for cross-layer KV cache compression through adjacent parameters sharing.<n>We show that the proposed method consistently outperforms existing low-rank and cross-layer approaches at various compression ratios.
arXiv Detail & Related papers (2025-08-22T06:55:45Z)
ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration [69.57122277845293]
We propose ReCalKV, a post-training low-rank KV cache compression approach with tailored strategies for Keys and Values.<n>For Keys, we propose Similarity aware Recontext (HSR), which clusters structurally similar heads into groups, enabling more accurate low-rank approximation.<n>For Values, we propose Offline Head-wise Value (OVC), which efficiently calibrates the value projection matrix using calibration data without training.
arXiv Detail & Related papers (2025-05-30T08:49:27Z)
KVShare: An LLM Service System with Efficient and Effective Multi-Tenant KV Cache Reuse [17.301289617498448]
We present a KV cache management module that shares the KV cache across requests under multi-tenant scenarios.<n> KVShare reduces TTFT by up to 9.39x and increases 1.2x of the throughput compared to the full KV recompute.<n> KVShare achieves 20.38% boost in terms of accuracy compared to SOTA methods.
arXiv Detail & Related papers (2025-03-17T16:43:35Z)
DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance [125.81664663201282]
We introduce a new KV cache compression method dubbed DBudgetKV.<n>It features an attention-based metric to signal when the remaining KV cache is unlikely to match the full-cache performance.<n>Our method achieves lossless KV pruning effectively and robustly, exceeding 25% compression ratio on average.
arXiv Detail & Related papers (2025-02-24T06:33:39Z)
Can LLMs Maintain Fundamental Abilities under KV Cache Compression? [29.510433427184385]
We present a benchmark KVFundaBench to evaluate the effects of KV cache compression across diverse fundamental language models.<n>We propose ShotKV, a novel compression approach that handles prefill and decoding phases while maintaining shot-level semantic coherence.
arXiv Detail & Related papers (2025-02-04T02:23:06Z)
ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference [61.412894960600205]
Large Language Models (LLMs) require significant GPU memory when processing long texts.<n>ChunkKV reimagines KV cache compression by treating semantic chunks as basic compression units.<n>Result: ChunkKV outperforms state-of-the-art methods by up to 8.7% in precision.
arXiv Detail & Related papers (2025-02-01T03:49:47Z)
KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing [58.29726147780976]
We propose a plug-and-play method called textit KVSharer, which shares the KV cache between layers to achieve layer-wise compression. Experiments show that textit KVSharer can reduce KV cache computation by 30%, thereby lowering memory consumption. We verify that textit KVSharer is compatible with existing intra-layer KV cache compression methods, and combining both can further save memory.
arXiv Detail & Related papers (2024-10-24T08:06:41Z)
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs) Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time. We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z)
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling [38.732413451399]
Pyramid KV is a novel and effective KV cache compression method.<n>We show that Pyramid KV matches the performance of models with a full KV cache while retaining only 12% of the KV cache.<n>In the Needle-in-a-Haystack experiment, Pyramid KV outperforms competing methods in maintaining long-context comprehension.
arXiv Detail & Related papers (2024-06-04T07:51:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.