KV Cache Recycling to Expand Usable Context Capacity in Low Parameter LLMs
- URL: http://arxiv.org/abs/2512.11851v1
- Date: Thu, 04 Dec 2025 17:04:43 GMT
- Title: KV Cache Recycling to Expand Usable Context Capacity in Low Parameter LLMs
- Authors: Prashant Pandey,
- Abstract summary: We build a cache of past activations and get entries by sentence embeddings, then reuse cached past key values when the cached prompt is an exact prefix of the new input.<n>We compare recycled vs. baseline runs on latency and output fidelity, and log reuse depth in tokens.<n>In tests, we observe consistent speedups when prefix overlap exists, with no material degradation in output semantics, and when overlap is absent.
- Score: 2.261486598306908
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Whether attention key value (KV) states computed for one prompt for a small LLM can be reused to accelerate inference on a new similar prompt, giving an increase to the space to its context memory using an approach called token recycling. Using a standard Hugging Face setup with DialoGPT-medium (a 345M parameter GPT-2 style decoder trained on 147M Reddit exchanges, 2005 to 2017) as the testbed, we build a cache of past activations and get entries by sentence embeddings, then reuse cached past key values when the cached prompt is an exact prefix of the new input. We compare recycled vs. baseline runs on latency and output fidelity, and log reuse depth in tokens. Reproducibility requires no model modifications, cached KVs are serialized to the CPU, reloaded, and supplied to the generate function to continue decoding from the cached prefix. In tests, we observe consistent speedups when prefix overlap exists, with no material degradation in output semantics, and when overlap is absent, behavior matches baseline.
Related papers
- Trellis: Learning to Compress Key-Value Memory in Attention Models [48.12167339402521]
This paper introduces Trellis, a novel Transformer architecture with bounded memory.<n> Trellis replaces the standard KV cache with a fixed-size memory and train a two-pass recurrent compression mechanism to store new keys and values into memory.<n>Experiments on language modeling, common-sense reasoning, recall-intensive tasks, and time series show that the proposed architecture outperforms strong baselines.
arXiv Detail & Related papers (2025-12-29T20:32:10Z) - SemShareKV: Efficient KVCache Sharing for Semantically Similar Prompts via Token-Level LSH Matching [0.8307668828380427]
We propose textitSemShareKV, a KV cache sharing and compression framework for large language models (LLMs)<n>Instead of relying on exact token matches, SemShareKV applies fuzzy token matching using locality-sensitive hashing (LSH) on token embeddings and incorporates Rotary Position Embedding (RoPE) to better preserve positional information.<n> Experiments on diverse summarization datasets show up to 6.25$times$ speedup and 42% lower GPU memory usage with 5k tokens input, with negligible quality degradation.
arXiv Detail & Related papers (2025-09-29T14:16:13Z) - Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction [53.83828564664595]
Large language models (LLMs) utilize key-value ( KV) cache to store historical information during sequence processing.<n>Current methods for KV cache eviction typically utilize the last window from the pre-filling phase as queries to compute the KV importance scores for eviction.<n>We propose Judge Q, a novel training method which incorporates a soft token list.
arXiv Detail & Related papers (2025-09-13T03:34:12Z) - Retrospective Sparse Attention for Efficient Long-Context Generation [5.562294018150909]
RetroAttention retrospectively revises past attention outputs using newly arrived KV entries from subsequent decoding steps.<n>This breaks the fixed-attention-output paradigm and allows continual correction of prior approximations.<n>Experiments show that RetroAttention consistently outperforms state-of-the-art (SOTA) KV compression methods.
arXiv Detail & Related papers (2025-08-12T15:11:47Z) - EFIM: Efficient Serving of LLMs for Infilling Tasks with Improved KV Cache Reuse [22.769631685777494]
Cross-request key-value ( KV) cache reuse is a technique that stores and reuses intermediate computations.<n>Infilling tasks, the KV cache reuse is often hindered by the structure of the prompt format.<n>We propose EFIM, a transformed prompt format of FIM to unleash the performance potential of KV cache reuse.
arXiv Detail & Related papers (2025-05-28T02:07:03Z) - RefreshKV: Updating Small KV Cache During Long-form Generation [54.00118604124301]
We propose a new inference method, RefreshKV, that flexibly alternates between full context attention and attention over a subset of input tokens during generation.<n>Applying our method to off-the-shelf LLMs achieves comparable speedup to eviction-based methods while improving performance for various long-form generation tasks.
arXiv Detail & Related papers (2024-11-08T18:57:07Z) - Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models.
We propose an importance-driven cache merging strategy to prune redundancy caches.
For instruction encoding, we utilize the frequency to evaluate the importance of caches.
Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z) - Training-Free Exponential Context Extension via Cascading KV Cache [49.608367376911694]
We introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens.<n>Our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
arXiv Detail & Related papers (2024-06-24T03:59:17Z) - CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion [15.344568214955688]
Large language models (LLMs) often incorporate multiple text chunks in their inputs to provide the necessary contexts.<n>To speed up the prefill of the long LLM inputs, one can pre-compute the KV cache of a text and re-use the KV cache when the context is reused as the prefix of another LLM input.<n>This paper tackles just one challenge: how to quickly combine their precomputed KV caches in order to achieve the same generation quality as the expensive full prefill.
arXiv Detail & Related papers (2024-05-26T06:00:17Z) - Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference [78.65321721142624]
We focus on a memory bottleneck imposed by the key-value ( KV) cache.
Existing KV cache methods approach this problem by pruning or evicting large swaths of relatively less important KV pairs.
We propose LESS, a simple integration of a constant sized cache with eviction-based cache methods.
arXiv Detail & Related papers (2024-02-14T18:54:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.