SkipKV: Selective Skipping of KV Generation and Storage for Efficient Inference with Large Reasoning Models
- URL: http://arxiv.org/abs/2512.07993v1
- Date: Mon, 08 Dec 2025 19:32:06 GMT
- Title: SkipKV: Selective Skipping of KV Generation and Storage for Efficient Inference with Large Reasoning Models
- Authors: Jiayi Tian, Seyedarmin Azizi, Yequan Zhao, Erfan Baghaei Potraghloo, Sean McPherson, Sharath Nittur Sridhar, Zhengyang Wang, Zheng Zhang, Massoud Pedram, Souvik Kundu,
- Abstract summary: Large reasoning models (LRMs) often cost significant key-value (KV) cache overhead, due to their linear growth with the chain-of-thought (CoT) reasoning process.<n>We present textbfSkipKV, a KV compression method for selective textiteviction and textitgeneration operating at a coarse-grained sentence-level sequence removal.
- Score: 25.509962883211
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large reasoning models (LRMs) often cost significant key-value (KV) cache overhead, due to their linear growth with the verbose chain-of-thought (CoT) reasoning process. This costs both memory and throughput bottleneck limiting their efficient deployment. Towards reducing KV cache size during inference, we first investigate the effectiveness of existing KV cache eviction methods for CoT reasoning. Interestingly, we find that due to unstable token-wise scoring and the reduced effective KV budget caused by padding tokens, state-of-the-art (SoTA) eviction methods fail to maintain accuracy in the multi-batch setting. Additionally, these methods often generate longer sequences than the original model, as semantic-unaware token-wise eviction leads to repeated revalidation during reasoning. To address these issues, we present \textbf{SkipKV}, a \textbf{\textit{training-free}} KV compression method for selective \textit{eviction} and \textit{generation} operating at a coarse-grained sentence-level sequence removal for efficient CoT reasoning. In specific, it introduces a \textit{sentence-scoring metric} to identify and remove highly similar sentences while maintaining semantic coherence. To suppress redundant generation, SkipKV dynamically adjusts a steering vector to update the hidden activation states during inference enforcing the LRM to generate concise response. Extensive evaluations on multiple reasoning benchmarks demonstrate the effectiveness of SkipKV in maintaining up to $\mathbf{26.7}\%$ improved accuracy compared to the alternatives, at a similar compression budget. Additionally, compared to SoTA, SkipKV yields up to $\mathbf{1.6}\times$ fewer generation length while improving throughput up to $\mathbf{1.7}\times$.
Related papers
- Towards Efficient Large Language Reasoning Models via Extreme-Ratio Chain-of-Thought Compression [55.63153956934198]
Chain-of-Thought (CoT) reasoning successfully enhances the reasoning capabilities of Large Language Models (LLMs)<n>Existing CoT compression methods often suffer from a critical loss of logical fidelity at high compression ratios.<n>We propose a novel EXTreme-RAtio Chain-of-Thought Compression framework, termed Extra-CoT, which aggressively reduces the token budget while preserving answer accuracy.
arXiv Detail & Related papers (2026-02-09T06:57:15Z) - DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity [50.52392445266824]
We propose a residual-based KV cache compression framework motivated by long-range inter-token similarity and highly shared latent components in KV representations.<n>Instead of discarding tokens, DeltaKV encodes semantic residuals relative to retrieved historical references, preserving fidelity while substantially reducing storage.<n>Experiments show that DeltaKV reduces KV cache memory to 29% of the original while maintaining near-lossless accuracy on LongBench, SCBench, and AIME.
arXiv Detail & Related papers (2026-02-08T15:14:36Z) - ForesightKV: Optimizing KV Cache Eviction for Reasoning Models by Learning Long-Term Contribution [84.41751286055909]
We develop a training-based KV cache eviction framework that learns to predict which KV pairs to evict during longtext generations.<n>We formulate cache eviction as a Markov Decision Process and apply the GRPO algorithm to mitigate the significant language modeling loss increase on low-entropy tokens.
arXiv Detail & Related papers (2026-02-03T07:16:51Z) - Training-free Context-adaptive Attention for Efficient Long Context Modeling [57.703159205740185]
Training-free Context-adaptive Attention (TCA-Attention) is a training-free sparse attention mechanism that selectively attends to only the informative tokens for efficient long-context inference.<n>TCA-Attention achieves a 2.8$times$ speedup and reduces KV cache by 61% at 128K context length while maintaining performance comparable to full attention.
arXiv Detail & Related papers (2025-12-10T01:54:57Z) - G-KV: Decoding-Time KV Cache Eviction with Global Attention [57.47409249054187]
Large language models (LLMs) excel in complex tasks but encounter significant computational and memory challenges due to long sequence lengths.<n> KV cache compression has emerged as an effective approach to greatly enhance the efficiency of reasoning.<n>We propose G-KV, a KV cache eviction method that employs a global scoring mechanism, combining local and historical attention scores to more accurately assess token importance.
arXiv Detail & Related papers (2025-11-29T14:21:33Z) - Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction [53.83828564664595]
Large language models (LLMs) utilize key-value ( KV) cache to store historical information during sequence processing.<n>Current methods for KV cache eviction typically utilize the last window from the pre-filling phase as queries to compute the KV importance scores for eviction.<n>We propose Judge Q, a novel training method which incorporates a soft token list.
arXiv Detail & Related papers (2025-09-13T03:34:12Z) - SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning [33.149133156465474]
Long-context inference in large language models is increasingly constrained by the KV cache bottleneck.<n>We propose SPARK, a training-free plug-and-play method that applies unstructured sparsity by pruning KV at the channel level.<n>By reducing channel-level redundancy, SPARK enables processing of longer sequences within the same memory budget.
arXiv Detail & Related papers (2025-08-21T03:48:28Z) - Retrospective Sparse Attention for Efficient Long-Context Generation [5.562294018150909]
RetroAttention retrospectively revises past attention outputs using newly arrived KV entries from subsequent decoding steps.<n>This breaks the fixed-attention-output paradigm and allows continual correction of prior approximations.<n>Experiments show that RetroAttention consistently outperforms state-of-the-art (SOTA) KV compression methods.
arXiv Detail & Related papers (2025-08-12T15:11:47Z) - LazyEviction: Lagged KV Eviction with Attention Pattern Observation for Efficient Long Reasoning [21.761205124793175]
extended reasoning sequences introduce significant GPU memory overhead due to increased key-value (KV) cache.<n>Existing KV cache compression methods mitigate memory bottlenecks but struggle in long reasoning tasks.<n>We propose LazyEviction, an observation window-based lagged eviction framework retaining latent recurring tokens by prioritized eviction based on tokens' recurrence patterns.
arXiv Detail & Related papers (2025-06-19T02:25:04Z) - SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching [5.545249777535219]
SentenceKV is a novel KV caching approach designed to enhance inference efficiency while preserving semantic coherence.<n>We show that SentenceKV significantly outperforms state-of-the-art methods in both efficiency and memory usage, without compromising model accuracy.
arXiv Detail & Related papers (2025-04-01T17:08:57Z) - Dialogue Without Limits: Constant-Sized KV Caches for Extended Responses in LLMs [6.222287867011644]
We propose MorphKV, an inference-time technique that maintains a constant-sized KV cache while preserving accuracy.<n>Unlike retention or lossy compression, MorphKV iteratively refines the KV cache via lightweight updates guided by attention patterns of recent tokens.<n>Our studies show 52.9$%$ memory savings and 18.2$%$ higher accuracy on average compared to state-of-the-art prior works.
arXiv Detail & Related papers (2025-03-02T18:12:50Z) - Activation-aware Probe-Query: Effective Key-Value Retrieval for Long-Context LLMs Inference [56.71209737306054]
We propose textbfActQKV, a training-free, textbfActivation-aware approach that dynamically determines probe-textbfQuery and leverages it to retrieve the relevant textbfKV pairs for inference.<n>Experiments on the Long-Bench and $infty$ Benchmarks demonstrate its state-of-the-art performance with competitive inference quality and resource efficiency.
arXiv Detail & Related papers (2025-02-19T08:50:44Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.