Block-wise Adaptive Caching for Accelerating Diffusion Policy
- URL: http://arxiv.org/abs/2506.13456v1
- Date: Mon, 16 Jun 2025 13:14:58 GMT
- Title: Block-wise Adaptive Caching for Accelerating Diffusion Policy
- Authors: Kangye Ji, Yuan Meng, Hanyun Cui, Ye Li, Shengjia Hua, Lei Chen, Zhi Wang,
- Abstract summary: Block-wise Adaptive Caching(BAC) is a method to accelerate Diffusion Policy by caching intermediate action features.<n>BAC achieves up to 3x inference speedup for free on robotic benchmarks.
- Score: 10.641633189595302
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion Policy has demonstrated strong visuomotor modeling capabilities, but its high computational cost renders it impractical for real-time robotic control. Despite huge redundancy across repetitive denoising steps, existing diffusion acceleration techniques fail to generalize to Diffusion Policy due to fundamental architectural and data divergences. In this paper, we propose Block-wise Adaptive Caching(BAC), a method to accelerate Diffusion Policy by caching intermediate action features. BAC achieves lossless action generation acceleration by adaptively updating and reusing cached features at the block level, based on a key observation that feature similarities vary non-uniformly across timesteps and locks. To operationalize this insight, we first propose the Adaptive Caching Scheduler, designed to identify optimal update timesteps by maximizing the global feature similarities between cached and skipped features. However, applying this scheduler for each block leads to signiffcant error surges due to the inter-block propagation of caching errors, particularly within Feed-Forward Network (FFN) blocks. To mitigate this issue, we develop the Bubbling Union Algorithm, which truncates these errors by updating the upstream blocks with signiffcant caching errors before downstream FFNs. As a training-free plugin, BAC is readily integrable with existing transformer-based Diffusion Policy and vision-language-action models. Extensive experiments on multiple robotic benchmarks demonstrate that BAC achieves up to 3x inference speedup for free.
Related papers
- Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction [58.044803442346115]
Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and parallel decoding but suffer from prohibitive computational complexity and memory overhead during inference.<n>We propose Sparse-dLLM, the first training-free framework integrating dynamic cache eviction with sparse attention via delayed bidirectional sparse caching.
arXiv Detail & Related papers (2025-08-04T16:14:03Z) - Sortblock: Similarity-Aware Feature Reuse for Diffusion Model [9.749736545966694]
Diffusion Transformers (DiTs) have demonstrated remarkable generative capabilities.<n>DiTs' sequential denoising process results in high inference latency.<n>We propose Sortblock, a training-free inference acceleration framework.
arXiv Detail & Related papers (2025-08-01T08:10:54Z) - Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding [51.711605076319216]
Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities.<n>We introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop.<n>We propose a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency violations and maintaining generation quality.
arXiv Detail & Related papers (2025-05-28T17:39:15Z) - FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation [46.57781555466333]
Diffusion Transformers (DiT) are powerful generative models but remain computationally intensive due to their iterative structure and deep transformer stacks.<n>FastCache is a hidden-state-level caching and compression framework that accelerates DiT inference.<n> Empirical evaluations across multiple DiT variants demonstrate substantial reductions in latency and memory usage.
arXiv Detail & Related papers (2025-05-26T05:58:49Z) - AB-Cache: Training-Free Acceleration of Diffusion Models via Adams-Bashforth Cached Feature Reuse [19.13826316844611]
Diffusion models have demonstrated remarkable success in generative tasks, yet their iterative denoising process results in slow inference.<n>We provide a theoretical understanding by analyzing the denoising process through the second-order Adams-Bashforth method.<n>We propose a novel caching-based acceleration approach for diffusion models, instead of directly reusing cached results.
arXiv Detail & Related papers (2025-04-13T08:29:58Z) - Exposure Bias Reduction for Enhancing Diffusion Transformer Feature Caching [7.393824353099595]
Diffusion Transformer (DiT) has exhibited impressive generation capabilities but faces great challenges due to its high computational complexity.<n>We analyze the impact of caching on the SNR of the diffusion process.<n>We introduce EB-Cache, a joint cache strategy that aligns the Non-exposure bias.
arXiv Detail & Related papers (2025-03-10T09:49:18Z) - QuantCache: Adaptive Importance-Guided Quantization with Hierarchical Latent and Layer Caching for Video Generation [84.91431271257437]
Diffusion Transformers (DiTs) have emerged as a dominant architecture in video generation.<n>DiTs come with significant drawbacks, including increased computational and memory costs.<n>We propose QuantCache, a novel training-free inference acceleration framework.
arXiv Detail & Related papers (2025-03-09T10:31:51Z) - ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference [41.41316718220569]
ExpertFlow is designed to enhance inference efficiency by accommodating flexible routing and enabling efficient expert scheduling between CPU and GPU.
Our experiments demonstrate that ExpertFlow achieves up to 93.72% GPU memory savings and enhances inference speed by 2 to 10 times compared to baseline methods.
arXiv Detail & Related papers (2024-10-23T15:24:54Z) - HarmoniCa: Harmonizing Training and Inference for Better Feature Caching in Diffusion Transformer Acceleration [31.982294870690925]
We develop a novel learning-based caching framework dubbed HarmoniCa.<n>It incorporates Step-Wise Denoising Training (SDT) to ensure the continuity of the denoising process.<n>Our framework achieves over $40%$ latency reduction (i.e., $2.07times$ theoretical speedup) and improved performance on PixArt-$alpha$.
arXiv Detail & Related papers (2024-10-02T16:34:29Z) - Token Caching for Diffusion Transformer Acceleration [30.437462937127773]
TokenCache is a novel post-training acceleration method for diffusion transformers.
It reduces redundant computations among tokens across inference steps.
We show that TokenCache achieves an effective trade-off between generation quality and inference speed for diffusion transformers.
arXiv Detail & Related papers (2024-09-27T08:05:34Z) - Temporal Feature Matters: A Framework for Diffusion Model Quantization [105.3033493564844]
Diffusion models rely on the time-step for the multi-round denoising.<n>We introduce a novel quantization framework that includes three strategies.<n>This framework preserves most of the temporal information and ensures high-quality end-to-end generation.
arXiv Detail & Related papers (2024-07-28T17:46:15Z) - Accelerating Deep Learning Classification with Error-controlled
Approximate-key Caching [72.50506500576746]
We propose a novel caching paradigm, that we named approximate-key caching.
While approximate cache hits alleviate DL inference workload and increase the system throughput, they however introduce an approximation error.
We analytically model our caching system performance for classic LRU and ideal caches, we perform a trace-driven evaluation of the expected performance, and we compare the benefits of our proposed approach with the state-of-the-art similarity caching.
arXiv Detail & Related papers (2021-12-13T13:49:11Z) - Harnessing Wireless Channels for Scalable and Privacy-Preserving
Federated Learning [56.94644428312295]
Wireless connectivity is instrumental in enabling federated learning (FL)
Channel randomnessperturbs each worker inversions model update while multiple workers updates incur significant interference on bandwidth.
In A-FADMM, all workers upload their model updates to the parameter server using a single channel via analog transmissions.
This not only saves communication bandwidth, but also hides each worker's exact model update trajectory from any eavesdropper.
arXiv Detail & Related papers (2020-07-03T16:31:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.