Related papers: Randomization Boosts KV Caching, Learning Balances Query Load: A Joint Perspective

Randomization Boosts KV Caching, Learning Balances Query Load: A Joint Perspective

URL: http://arxiv.org/abs/2601.18999v1
Date: Mon, 26 Jan 2026 22:20:59 GMT
Title: Randomization Boosts KV Caching, Learning Balances Query Load: A Joint Perspective
Authors: Fangzhou Wu, Sandeep Silwal, Qiuyi, Zhang,
Abstract summary: KV caching is a technique for accelerating Large Language Model (LLM) inference by reusing key-value ( KV) pairs from previous queries.<n>The default Least Recently Used (LRU) eviction algorithm struggles with dynamic online query arrivals.<n>We give the first unified mathematical model that captures the core trade-offs between KV cache eviction and query routing.
Score: 31.67506313325633
License: http://creativecommons.org/licenses/by/4.0/
Abstract: KV caching is a fundamental technique for accelerating Large Language Model (LLM) inference by reusing key-value (KV) pairs from previous queries, but its effectiveness under limited memory is highly sensitive to the eviction policy. The default Least Recently Used (LRU) eviction algorithm struggles with dynamic online query arrivals, especially in multi-LLM serving scenarios, where balancing query load across workers and maximizing cache hit rate of each worker are inherently conflicting objectives. We give the first unified mathematical model that captures the core trade-offs between KV cache eviction and query routing. Our analysis reveals the theoretical limitations of existing methods and leads to principled algorithms that integrate provably competitive randomized KV cache eviction with learning-based methods to adaptively route queries with evolving patterns, thus balancing query load and cache hit rate. Our theoretical results are validated by extensive experiments across 4 benchmarks and 3 prefix-sharing settings, demonstrating improvements of up to 6.92$\times$ in cache hit rate, 11.96$\times$ reduction in latency, 14.06$\times$ reduction in time-to-first-token (TTFT), and 77.4% increase in throughput over the state-of-the-art methods. Our code is available at https://github.com/fzwark/KVRouting.

Related papers

Learning to Evict from Key-Value Cache [17.365511268829703]
We introduce KV Policy, a framework for learning to rank tokens by their predicted usefulness for future decoding.<n> evaluated across two different model families on the long-context benchmark RULER and the multi-turn dialogue benchmark OASST2-4k.<n>Results demonstrate that learning to predict future token utility is a powerful and scalable paradigm for adaptive KV cache management.
arXiv Detail & Related papers (2026-02-10T19:34:15Z)
ForesightKV: Optimizing KV Cache Eviction for Reasoning Models by Learning Long-Term Contribution [84.41751286055909]
We develop a training-based KV cache eviction framework that learns to predict which KV pairs to evict during longtext generations.<n>We formulate cache eviction as a Markov Decision Process and apply the GRPO algorithm to mitigate the significant language modeling loss increase on low-entropy tokens.
arXiv Detail & Related papers (2026-02-03T07:16:51Z)
Fast KVzip: Efficient and Accurate LLM Inference with Gated KV Eviction [50.99402504483692]
We propose a novel gating-based KV cache eviction method for frozen-weight language models.<n>Our approach integrates seamlessly into both the prefill and decoding stages.<n>Experiments show that our method maintains near-lossless performance while evicting up to 70% of the KV cache.
arXiv Detail & Related papers (2026-01-25T03:07:54Z)
SkipKV: Selective Skipping of KV Generation and Storage for Efficient Inference with Large Reasoning Models [25.509962883211]
Large reasoning models (LRMs) often cost significant key-value (KV) cache overhead, due to their linear growth with the chain-of-thought (CoT) reasoning process.<n>We present textbfSkipKV, a KV compression method for selective textiteviction and textitgeneration operating at a coarse-grained sentence-level sequence removal.
arXiv Detail & Related papers (2025-12-08T19:32:06Z)
Judge Q: Trainable Queries for Optimized Information Retention in KV Cache Eviction [53.83828564664595]
Large language models (LLMs) utilize key-value ( KV) cache to store historical information during sequence processing.<n>Current methods for KV cache eviction typically utilize the last window from the pre-filling phase as queries to compute the KV importance scores for eviction.<n>We propose Judge Q, a novel training method which incorporates a soft token list.
arXiv Detail & Related papers (2025-09-13T03:34:12Z)
Lookahead Q-Cache: Achieving More Consistent KV Cache Eviction via Pseudo Query [48.52389201779425]
KV cache memory usage grows substantially with longer text sequences.<n>Existing KV cache eviction methods prune tokens using prefilling-stage attention scores.<n>Lookahead Q-Cache generates low-cost pseudo lookahead queries to better approximate the true decoding-stage queries.
arXiv Detail & Related papers (2025-05-24T10:34:38Z)
Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference [37.94892570127548]
Large Language Models have excelled in various domains but face efficiency challenges due to the growing Key-Value (KV) cache.<n>Recent efforts aim to reduce KV cache size by evicting vast non-critical cache elements during runtime.<n>We propose Ada-KV, the first head-wise adaptive budget allocation strategy.
arXiv Detail & Related papers (2024-07-16T09:53:32Z)
Training-Free Exponential Context Extension via Cascading KV Cache [49.608367376911694]
We introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens.<n>Our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
arXiv Detail & Related papers (2024-06-24T03:59:17Z)
Accelerating Deep Learning Classification with Error-controlled Approximate-key Caching [72.50506500576746]
We propose a novel caching paradigm, that we named approximate-key caching. While approximate cache hits alleviate DL inference workload and increase the system throughput, they however introduce an approximation error. We analytically model our caching system performance for classic LRU and ideal caches, we perform a trace-driven evaluation of the expected performance, and we compare the benefits of our proposed approach with the state-of-the-art similarity caching.
arXiv Detail & Related papers (2021-12-13T13:49:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.