Related papers: Attention Is All You Need for KV Cache in Diffusion LLMs

Attention Is All You Need for KV Cache in Diffusion LLMs

URL: http://arxiv.org/abs/2510.14973v1
Date: Thu, 16 Oct 2025 17:59:48 GMT
Title: Attention Is All You Need for KV Cache in Diffusion LLMs
Authors: Quan Nguyen-Tri, Mukul Ranjan, Zhiqiang Shen,
Abstract summary: Elastic-Cache performs adaptive, layer-aware cache updates for diffusion large language models.<n>Our method achieves significantly higher throughput ($6.8times$ on GSM8K) than existing confidence-based approaches.
Score: 36.94369617373333
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This work studies how to adaptively recompute key-value (KV) caches for diffusion large language models (DLMs) to maximize prediction accuracy while minimizing decoding latency. Prior methods' decoders recompute QKV for all tokens at every denoising step and layer, despite KV states changing little across most steps, especially in shallow layers, leading to substantial redundancy. We make three observations: (1) distant ${\bf MASK}$ tokens primarily act as a length-bias and can be cached block-wise beyond the active prediction window; (2) KV dynamics increase with depth, suggesting that selective refresh starting from deeper layers is sufficient; and (3) the most-attended token exhibits the smallest KV drift, providing a conservative lower bound on cache change for other tokens. Building on these, we propose ${\bf Elastic-Cache}$, a training-free, architecture-agnostic strategy that jointly decides ${when}$ to refresh (via an attention-aware drift test on the most-attended token) and ${where}$ to refresh (via a depth-aware schedule that recomputes from a chosen layer onward while reusing shallow-layer caches and off-window MASK caches). Unlike fixed-period schemes, Elastic-Cache performs adaptive, layer-aware cache updates for diffusion LLMs, reducing redundant computation and accelerating decoding with negligible loss in generation quality. Experiments on LLaDA-Instruct, LLaDA-1.5, and LLaDA-V across mathematical reasoning and code generation tasks demonstrate consistent speedups: $8.7\times$ on GSM8K (256 tokens), $45.1\times$ on longer sequences, and $4.8\times$ on HumanEval, while consistently maintaining higher accuracy than the baseline. Our method achieves significantly higher throughput ($6.8\times$ on GSM8K) than existing confidence-based approaches while preserving generation quality, enabling practical deployment of diffusion LLMs.

Related papers

SPA-Cache: Singular Proxies for Adaptive Caching in Diffusion Language Models [56.45983529954998]
We present SPA-Cache that jointly optimize update identification and budget allocation in DLM cache.<n>First, we derive a low-dimensional singular proxy that enables the identification of update-critical tokens in a low-dimensional subspace.<n>Second, we introduce an adaptive strategy that allocates fewer updates to stable layers without degrading generation quality.
arXiv Detail & Related papers (2026-01-30T05:22:44Z)
d$^2$Cache: Accelerating Diffusion-Based LLMs via Dual Adaptive Caching [7.004421957218099]
Diffusion-based large language models (dLLMs) suffer from inferior inference efficiency.<n>We introduce d$2$Cache, which is a training-free approximate KV cache framework for accelerating dLLM inference.
arXiv Detail & Related papers (2025-09-27T04:07:23Z)
dKV-Cache: The Cache for Diffusion Language Models [53.85291644298835]
Diffusion Language Models (DLMs) have been seen as a promising competitor for autoregressive language models.<n>We propose a KV-cache-like mechanism, delayed KV-Cache, for the denoising process of DLMs.<n>Our approach is motivated by the observation that different tokens have distinct representation dynamics throughout the diffusion process.
arXiv Detail & Related papers (2025-05-21T17:32:10Z)
Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving [10.835583587146274]
This paper presents PSA, a $underlineP$rogressive $underlineS$parse $underlineA$ttention mechanism.<n>It integrates algorithmic innovations with system co-design to achieve both high inference accuracy and improved efficiency in large language models.<n>Experiments demonstrate that PSA reduces KV cache usage for attention computation by up to 2.4$times$ and 8.8$times$, and increases end-to-end serving throughput by up to 1.4$times$ and 2.0$times$.
arXiv Detail & Related papers (2025-03-01T07:56:42Z)
QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache [67.84112700032007]
Large Language Models (LLMs) are increasingly being deployed on edge devices for long-context settings.<n>In these scenarios, the Key-Value ( KV) cache is the primary bottleneck in terms of both GPU memory and latency.<n>We propose a novel self-speculative decoding framework, QuantSpec, where the draft model shares the architecture of the target model but employs a hierarchical 4-bit quantized KV cache and 4-bit quantized weights for acceleration.
arXiv Detail & Related papers (2025-02-05T20:43:48Z)
PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation [97.41972925670508]
Large vision-language models (LVLMs) incur significant computational and memory overhead during inference.<n>We present PrefixKV, where "Prefix" means the top-ranked KV based on importance rather than position in the original sequence.<n>Our method achieves the state-of-the-art performance compared with others.
arXiv Detail & Related papers (2024-12-04T15:48:59Z)
ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression [10.003118268356017]
Long context poses significant challenges for inference efficiency.<n>We introduce ClusterKV, which recalls tokens at the granularity of semantic clusters.<n>Experiment results show that ClusterKV attains negligible accuracy loss across various tasks with 32k context lengths.
arXiv Detail & Related papers (2024-12-04T10:58:27Z)
VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration [7.463830743649754]
Vision-Language Models (VLMs) have demonstrated impressive performance across a versatile set of tasks. Key-Value (KV) cache encodes long visual contexts, such as images or videos. Existing KV cache compression methods are effective for Large Language Models (LLMs) We propose a novel KV cache compression recipe tailored for accelerating VLM inference.
arXiv Detail & Related papers (2024-10-29T20:04:34Z)
ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models. We propose an importance-driven cache merging strategy to prune redundancy caches. For instruction encoding, we utilize the frequency to evaluate the importance of caches. Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z)
Training-Free Exponential Context Extension via Cascading KV Cache [49.608367376911694]
We introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens.<n>Our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
arXiv Detail & Related papers (2024-06-24T03:59:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.