Related papers: d$^2$Cache: Accelerating Diffusion-Based LLMs via Dual Adaptive Caching

d$^2$Cache: Accelerating Diffusion-Based LLMs via Dual Adaptive Caching

URL: http://arxiv.org/abs/2509.23094v1
Date: Sat, 27 Sep 2025 04:07:23 GMT
Title: d$^2$Cache: Accelerating Diffusion-Based LLMs via Dual Adaptive Caching
Authors: Yuchu Jiang, Yue Cai, Xiangzhong Luo, Jiale Fu, Jiarui Wang, Chonghan Liu, Xu Yang,
Abstract summary: Diffusion-based large language models (dLLMs) suffer from inferior inference efficiency.<n>We introduce d$2$Cache, which is a training-free approximate KV cache framework for accelerating dLLM inference.
Score: 7.004421957218099
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion-based large language models (dLLMs), despite their promising performance, still suffer from inferior inference efficiency. This is because dLLMs rely on bidirectional attention and cannot directly benefit from the standard key-value (KV) cache as autoregressive models (ARMs) do. To tackle this issue, we introduce \textit{Dual aDaptive Cache} (d$^2$Cache), which is a training-free approximate KV cache framework for accelerating dLLM inference. d$^2$Cache features a two-stage fine-grained selection strategy to identify tokens and adaptively update their KV states at each decoding step, while caching the KV states of the remaining tokens for reuse. Furthermore, d$^2$Cache naturally offers a more reliable decoding alternative, which can enable quasi left-to-right generation and mitigate premature overconfidence in tokens at the end of the sequence. Extensive experimental results on two representative dLLMs (\ie, LLaDA and Dream) demonstrate that d$^2$Cache not only achieves substantial inference speedups, but also yields consistent improvements in generation quality. The code is available at https://github.com/Kamichanw/d2Cache.

Related papers

SPA-Cache: Singular Proxies for Adaptive Caching in Diffusion Language Models [56.45983529954998]
We present SPA-Cache that jointly optimize update identification and budget allocation in DLM cache.<n>First, we derive a low-dimensional singular proxy that enables the identification of update-critical tokens in a low-dimensional subspace.<n>Second, we introduce an adaptive strategy that allocates fewer updates to stable layers without degrading generation quality.
arXiv Detail & Related papers (2026-01-30T05:22:44Z)
VLCache: Computing 2% Vision Tokens and Reusing 98% for Vision-Language Inference [32.33685370786451]
VLCache is a cache reuse framework that exploits both KeyValue (KV) cache and encoderLang inputs to eliminate costly recomputation when the same multimodal cache recurs from prior approaches.<n>We show that VLCache achieves an accuracy on par with full recomputation, while requiring only 2-5% of the tokens to compute, yielding 1.2x-16x TTFT speedups.
arXiv Detail & Related papers (2025-12-15T04:45:47Z)
Attention Is All You Need for KV Cache in Diffusion LLMs [36.94369617373333]
Elastic-Cache performs adaptive, layer-aware cache updates for diffusion large language models.<n>Our method achieves significantly higher throughput ($6.8times$ on GSM8K) than existing confidence-based approaches.
arXiv Detail & Related papers (2025-10-16T17:59:48Z)
Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction [58.044803442346115]
Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and parallel decoding but suffer from prohibitive computational complexity and memory overhead during inference.<n>We propose Sparse-dLLM, the first training-free framework integrating dynamic cache eviction with sparse attention via delayed bidirectional sparse caching.
arXiv Detail & Related papers (2025-08-04T16:14:03Z)
LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models [52.56008278458534]
LaCache is a training-free method for efficient and accurate generative inference of Large Language Models.<n>LaCache enables LLMs to address both of the critical challenges in long-range modeling: robust long-range capabilities and continuous generation without running out-of-memory.
arXiv Detail & Related papers (2025-07-14T19:09:57Z)
FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation [43.83288560196838]
Diffusion Transformers (DiT) are powerful generative models but remain computationally intensive due to their iterative structure and deep transformer stacks.<n>FastCache is a hidden-state-level caching and compression framework that accelerates DiT inference.<n> Empirical evaluations across multiple DiT variants demonstrate substantial reductions in latency and memory usage.
arXiv Detail & Related papers (2025-05-26T05:58:49Z)
dKV-Cache: The Cache for Diffusion Language Models [53.85291644298835]
Diffusion Language Models (DLMs) have been seen as a promising competitor for autoregressive language models.<n>We propose a KV-cache-like mechanism, delayed KV-Cache, for the denoising process of DLMs.<n>Our approach is motivated by the observation that different tokens have distinct representation dynamics throughout the diffusion process.
arXiv Detail & Related papers (2025-05-21T17:32:10Z)
VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration [7.463830743649754]
Vision-Language Models (VLMs) have demonstrated impressive performance across a versatile set of tasks. Key-Value (KV) cache encodes long visual contexts, such as images or videos. Existing KV cache compression methods are effective for Large Language Models (LLMs) We propose a novel KV cache compression recipe tailored for accelerating VLM inference.
arXiv Detail & Related papers (2024-10-29T20:04:34Z)
Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models. We propose an importance-driven cache merging strategy to prune redundancy caches. For instruction encoding, we utilize the frequency to evaluate the importance of caches. Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z)
DeepCache: Accelerating Diffusion Models for Free [65.02607075556742]
DeepCache is a training-free paradigm that accelerates diffusion models from the perspective of model architecture. DeepCache capitalizes on the inherent temporal redundancy observed in the sequential denoising steps of diffusion models. Under the same throughput, DeepCache effectively achieves comparable or even marginally improved results with DDIM or PLMS.
arXiv Detail & Related papers (2023-12-01T17:01:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.