Related papers: Adaptive Soft Rolling KV Freeze with Entropy-Guided Recovery: Sublinear Memory Growth for Efficient LLM Inference

Adaptive Soft Rolling KV Freeze with Entropy-Guided Recovery: Sublinear Memory Growth for Efficient LLM Inference

URL: http://arxiv.org/abs/2512.11221v1
Date: Fri, 12 Dec 2025 02:02:02 GMT
Title: Adaptive Soft Rolling KV Freeze with Entropy-Guided Recovery: Sublinear Memory Growth for Efficient LLM Inference
Authors: Adilet Metinov, Gulida M. Kudakeeva, Bolotbek uulu Nursultan, Gulnara D. Kabaeva,
Abstract summary: We present a training-free inference-time framework for efficient large language model generation.<n>Our method introduces a reversible soft-freeze mechanism that temporarily suspends key-value updates for low-importance tokens.<n>We extend the framework with sublinear freeze scheduling, where freeze duration grows sublinearly with repeated low-importance detections.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present Adaptive Soft Rolling KV Freeze with Entropy-Guided Recovery (ASR-KF-EGR), a training-free inference-time framework for efficient large language model generation. Our method introduces a reversible soft-freeze mechanism that temporarily suspends key-value (KV) updates for low-importance tokens identified within a sliding attention window. Unlike eviction-based approaches that permanently discard context, ASR-KF-EGR preserves all tokens in off-GPU storage and restores them on demand. We extend the framework with sublinear freeze scheduling, where freeze duration grows sublinearly with repeated low-importance detections, preventing over-aggressive compression. Preliminary experiments on LLaMA-3 8B demonstrate 55-67% reduction in active KV cache size while maintaining generation quality and passing needle-in-haystack retrieval tests. The method is architecture-agnostic, requires no fine-tuning, and provides a practical solution for memory-constrained deployment of long-context LLMs.

Related papers

DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity [50.52392445266824]
We propose a residual-based KV cache compression framework motivated by long-range inter-token similarity and highly shared latent components in KV representations.<n>Instead of discarding tokens, DeltaKV encodes semantic residuals relative to retrieved historical references, preserving fidelity while substantially reducing storage.<n>Experiments show that DeltaKV reduces KV cache memory to 29% of the original while maintaining near-lossless accuracy on LongBench, SCBench, and AIME.
arXiv Detail & Related papers (2026-02-08T15:14:36Z)
Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction [50.99402504483692]
We propose a novel gating-based KV cache eviction method for frozen-weight language models.<n>Our approach integrates seamlessly into both the prefill and decoding stages.<n>Experiments show that our method maintains near-lossless performance while evicting up to 70% of the KV cache.
arXiv Detail & Related papers (2026-01-25T03:07:54Z)
Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs [26.951325519894525]
We propose a novel approach that learns each token's intrinsic importance at creation time via a lightweight retention gate.<n>We show that it consistently outperforms strong eviction and learnable retrieval baselines, especially in low-memory regimes.<n>It even surpasses full-cache models in some settings, showing that selective retention can serve as a form of regularization.
arXiv Detail & Related papers (2025-12-03T00:20:35Z)
Lethe: Layer- and Time-Adaptive KV Cache Pruning for Reasoning-Intensive LLM Serving [11.750209684686707]
Generative reasoning with large language models (LLMs) often involves long decoding sequences.<n>We propose Lethe, a dynamic KV cache management framework that introduces adaptivity along both the spatial and temporal dimensions of decoding.<n>Lethe achieves a favorable balance between efficiency and generation quality across diverse models and tasks, increases throughput by up to 2.56x.
arXiv Detail & Related papers (2025-11-08T14:52:43Z)
KV-Efficient VLA: A Method of Speed up Vision Language Model with RNN-Gated Chunked KV Cache [0.9238700679836854]
Vision-Language-Action (VLA) models promise unified robotic perception and control, yet their scalability is constrained by the quadratic cost of attention and the unbounded growth of key-value (KV) memory during long-horizon inference.<n>We present KV-Efficient VLA, a model-agnostic memory compression framework that addresses these limitations by introducing a lightweight, training-friendly mechanism to selectively retain high-utility context.<n>Our method integrates seamlessly into existing autoregressive and hybrid VLA stacks, enabling scalable inference without modifying training pipelines or downstream control logic.
arXiv Detail & Related papers (2025-09-20T02:04:24Z)
Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding [51.711605076319216]
Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities.<n>We introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop.<n>We propose a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency violations and maintaining generation quality.
arXiv Detail & Related papers (2025-05-28T17:39:15Z)
Runtime Adaptive Pruning for LLM Inference [7.5252252615137225]
We propose RAP, an elastic pruning framework driven by reinforcement learning (RL)<n>RAP tracks the evolving ratio between model parameters and KV-cache across practical execution.<n>RAP outperforms state-of-the-art baselines, marking the first time to jointly consider model weights and KVcache on the fly.
arXiv Detail & Related papers (2025-05-22T06:12:42Z)
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs) Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time. We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z)
ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping [49.66872823080736]
Autoregressive Large Language Models (e.g., LLaMa, GPTs) are omnipresent achieving remarkable success in language understanding and generation. To mitigate overload incurred during generation, several early-exit and layer-dropping strategies have been proposed. We propose FFN-SkipLLM, which is an input-adaptive feed-forward skipping strategy.
arXiv Detail & Related papers (2024-04-05T02:35:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.