Improving Model Representation and Reducing KV Cache via Skip Connections with First Value Heads
- URL: http://arxiv.org/abs/2510.16807v2
- Date: Thu, 23 Oct 2025 08:29:11 GMT
- Title: Improving Model Representation and Reducing KV Cache via Skip Connections with First Value Heads
- Authors: Zhoutong Wu, Yuan Zhang, Yiming Dong, Chenheng Zhang, Cong Fang, Kun Yuan, Zhouchen Lin,
- Abstract summary: SkipV1Former is a Transformer variant that uses skip connections from the first layer's Value heads to strengthen representation and reduce KV cache.<n>We show that SkipV1Former delivers consistent reductions of approximately 25 % in KV cache.<n>When combined with YOCO, it cuts KV cache size by nearly 50 % while still improving performance.
- Score: 47.05385031325841
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer models have driven breakthroughs across various language tasks by their strong capability to learn rich contextual representations. Scaling them to improve representation, however, often demands substantial memory and compute costs, such as the Key-Value (KV) cache used during auto-regressive decoding. Skip connections offer a promising way to improve representation without bloating resource usage, yet most prior works either improve expressivity while leaving KV costs unchanged, or reduce memory at the cost of weaker representation. In this work, we propose SkipV1Former, a Transformer variant that uses skip connections from the first layer's Value heads to strengthen model representation and reduce KV cache. Specifically, from the second block onward, each layer reuses half of its Value heads from the very first layer, while computing the other half as usual-cutting Value projections and V cache by nearly 50 \%. Theoretically, we show that routing uncompressed first-layer Values into deeper layers restores information lost to compression and accelerates the model's implicit mesa-optimization-a key pattern of Transformer in auto-regressive tasks. Empirically, across different model scales, SkipV1Former delivers consistent reductions of approximately 25 \% in KV cache while improving perplexity relative to standard Multi-Head Attention (MHA) Transformers and some advanced variants. Moreover, we propose a recipe for uptraining existing MHA Transformer checkpoints to SkipV1Former with only 10-15\% additional compute. Finally, SkipV1Former can seamlessly combine advanced methods like Group-Query Attention and Multi-Latent Attention to achieve further KV cache savings and performance improvement. When combined with YOCO, it cuts KV cache size by nearly 50 \% while still improving performance.
Related papers
- Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers [35.286226181391754]
Cross-layer KV Cache sharing offers a path to mitigate KV Cache bottleneck, but it typically underperforms within-layer methods like GQA.<n>We propose FusedKV, whose top-layer KV caches are a learnable fusion of the most informative ones from the bottom and middle layers.<n>Compared to FusedKV, FusedKV-Lite reduces I/O overhead at the cost of a slight increase in perplexity.
arXiv Detail & Related papers (2025-12-03T15:22:00Z) - KV-Latent: Dimensional-level KV Cache Reduction with Frequency-aware Rotary Positional Embedding [72.12756830560217]
Large language models (LLMs) based on Transformer Decoders have become the preferred choice for conversational generative AI.<n>Despite the overall superiority of the Decoder architecture, the gradually increasing Key-Value cache during inference has emerged as a primary efficiency bottleneck.<n>By down-sampling the Key-Value vector dimensions into a latent space, we can significantly reduce the KV Cache footprint and improve inference speed.
arXiv Detail & Related papers (2025-07-15T12:52:12Z) - Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression [21.840636839249026]
We introduce ScaleKV, a novel KV cache compression framework tailored for Visual Autoregressive ( VAR) architectures.<n>Based on two critical observations, ScaleKV categorizes transformer layers into two functional groups: drafters and refiners.<n>Our approach effectively reduces the required KV cache memory to 10% while preserving pixel-level fidelity.
arXiv Detail & Related papers (2025-05-26T07:11:42Z) - Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models [28.16603647353951]
AQUA-KV is an adaptive quantization for Key-Value caches that relies on compact adapters.<n>We achieve near-lossless inference at 2-2.5 bits per value with under $1%$ relative error in perplexity and LongBench scores.
arXiv Detail & Related papers (2025-01-31T18:47:42Z) - PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation [97.41972925670508]
Large vision-language models (LVLMs) incur significant computational and memory overhead during inference.<n>We present PrefixKV, where "Prefix" means the top-ranked KV based on importance rather than position in the original sequence.<n>Our method achieves the state-of-the-art performance compared with others.
arXiv Detail & Related papers (2024-12-04T15:48:59Z) - KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing [58.29726147780976]
We propose a plug-and-play method called textit KVSharer, which shares the KV cache between layers to achieve layer-wise compression.
Experiments show that textit KVSharer can reduce KV cache computation by 30%, thereby lowering memory consumption.
We verify that textit KVSharer is compatible with existing intra-layer KV cache compression methods, and combining both can further save memory.
arXiv Detail & Related papers (2024-10-24T08:06:41Z) - Value Residual Learning [13.88704205151734]
This paper introduces ResFormer, a novel architecture that enhances information flow by incorporating value residual connections.<n>It achieves equivalent validation loss with 16.11% fewer model parameters and 20.3% less training data compared to Transformer.
arXiv Detail & Related papers (2024-10-23T14:15:07Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.