Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers
- URL: http://arxiv.org/abs/2512.03870v1
- Date: Wed, 03 Dec 2025 15:22:00 GMT
- Title: Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers
- Authors: Hongzhan Lin, Zhiqi Bai, Xinmiao Zhang, Sen Yang, Xiang Li, Siran Yang, Yunlong Xu, Jiaheng Liu, Yongchi Zhao, Jiamang Wang, Yuchi Xu, Wenbo Su, Bo Zheng,
- Abstract summary: Cross-layer KV Cache sharing offers a path to mitigate KV Cache bottleneck, but it typically underperforms within-layer methods like GQA.<n>We propose FusedKV, whose top-layer KV caches are a learnable fusion of the most informative ones from the bottom and middle layers.<n>Compared to FusedKV, FusedKV-Lite reduces I/O overhead at the cost of a slight increase in perplexity.
- Score: 35.286226181391754
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Transformer decoders have achieved strong results across tasks, but the memory required for the KV cache becomes prohibitive at long sequence lengths. Although Cross-layer KV Cache sharing (e.g., YOCO, CLA) offers a path to mitigate KV Cache bottleneck, it typically underperforms within-layer methods like GQA. To understand the root cause, we investigate the information flow of keys and values of the top-layers. Our preliminary reveals a clear distribution: values are predominantly derived from the bottom layer, while keys draw more information from both bottom and middle layers. Building upon this, we propose FusedKV, whose top-layer KV caches are a learnable fusion of the most informative ones from the bottom and middle layers. This fusion operates directly on post-RoPE keys, preserving relative positional information without the computational cost of re-applying rotary embeddings. To further improve efficiency, we propose FusedKV-Lite, an cross-layer sharing approach, where top-layer KV caches are directly derived from the bottom-layer values and the middle-layer keys. Compared to FusedKV, FusedKV-Lite reduces I/O overhead at the cost of a slight increase in perplexity. In experiments on LLMs ranging from 332M to 4B parameters, our proposed method reduce 50\% cache memory while achieving lower validation perplexity than the standard Transformer decoder, establishing it as a memory-efficient, high-performance architectural alternative.
Related papers
- CommonKV: Compressing KV Cache with Cross-layer Parameter Sharing [54.34080239841088]
CommonKV is a training-free method for cross-layer KV cache compression through adjacent parameters sharing.<n>We show that the proposed method consistently outperforms existing low-rank and cross-layer approaches at various compression ratios.
arXiv Detail & Related papers (2025-08-22T06:55:45Z) - KV-Latent: Dimensional-level KV Cache Reduction with Frequency-aware Rotary Positional Embedding [72.12756830560217]
Large language models (LLMs) based on Transformer Decoders have become the preferred choice for conversational generative AI.<n>Despite the overall superiority of the Decoder architecture, the gradually increasing Key-Value cache during inference has emerged as a primary efficiency bottleneck.<n>By down-sampling the Key-Value vector dimensions into a latent space, we can significantly reduce the KV Cache footprint and improve inference speed.
arXiv Detail & Related papers (2025-07-15T12:52:12Z) - ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration [69.57122277845293]
We propose ReCalKV, a post-training low-rank KV cache compression approach with tailored strategies for Keys and Values.<n>For Keys, we propose Similarity aware Recontext (HSR), which clusters structurally similar heads into groups, enabling more accurate low-rank approximation.<n>For Values, we propose Offline Head-wise Value (OVC), which efficiently calibrates the value projection matrix using calibration data without training.
arXiv Detail & Related papers (2025-05-30T08:49:27Z) - KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing [58.29726147780976]
We propose a plug-and-play method called textit KVSharer, which shares the KV cache between layers to achieve layer-wise compression.
Experiments show that textit KVSharer can reduce KV cache computation by 30%, thereby lowering memory consumption.
We verify that textit KVSharer is compatible with existing intra-layer KV cache compression methods, and combining both can further save memory.
arXiv Detail & Related papers (2024-10-24T08:06:41Z) - A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference [41.149350870029046]
Key-value ( KV) cache has been found effective in efficient inference of large language models (LLMs)<n>We propose a unified framework that covers several recent methods and their novel variants.<n>We find that when reducing the size of the KV cache by 2$times$, most configurations can achieve higher throughput than standard transformers.
arXiv Detail & Related papers (2024-10-18T13:01:14Z) - MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection [14.073722038551125]
KV cache has become a de facto technique for the inference of large language models.<n>This paper uses low-rank projection matrices to transform the cache features into spaces with reduced dimensions.<n>We find that our method can sustain over 90% performance with an average KV cache compression rate of 60%.
arXiv Detail & Related papers (2024-10-16T08:34:51Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios [13.144156413032896]
We introduce CSKV, a training-efficient Channel Shrinking technique for KV cache compression.
We show that CSKV can reduce the memory overhead of the KV cache by 80% while maintaining the model's long-context capability.
Our method can be seamlessly combined with quantization to further reduce the memory overhead, achieving a compression ratio of up to 95%.
arXiv Detail & Related papers (2024-09-16T17:36:50Z) - PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling [38.732413451399]
Pyramid KV is a novel and effective KV cache compression method.<n>We show that Pyramid KV matches the performance of models with a full KV cache while retaining only 12% of the KV cache.<n>In the Needle-in-a-Haystack experiment, Pyramid KV outperforms competing methods in maintaining long-context comprehension.
arXiv Detail & Related papers (2024-06-04T07:51:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.