BlockDance: Reuse Structurally Similar Spatio-Temporal Features to Accelerate Diffusion Transformers
- URL: http://arxiv.org/abs/2503.15927v1
- Date: Thu, 20 Mar 2025 08:07:31 GMT
- Title: BlockDance: Reuse Structurally Similar Spatio-Temporal Features to Accelerate Diffusion Transformers
- Authors: Hui Zhang, Tingwei Gao, Jie Shao, Zuxuan Wu,
- Abstract summary: Diffusion Transformers (DiTs) continue to encounter challenges related to low inference speed.<n>We propose BlockDance, a training-free approach that explores feature similarities at adjacent time steps to accelerate DiTs.<n>We also introduce BlockDance-Ada, a lightweight decision-making network tailored for instance-specific acceleration.
- Score: 39.08730113749482
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion models have demonstrated impressive generation capabilities, particularly with recent advancements leveraging transformer architectures to improve both visual and artistic quality. However, Diffusion Transformers (DiTs) continue to encounter challenges related to low inference speed, primarily due to the iterative denoising process. To address this issue, we propose BlockDance, a training-free approach that explores feature similarities at adjacent time steps to accelerate DiTs. Unlike previous feature-reuse methods that lack tailored reuse strategies for features at different scales, BlockDance prioritizes the identification of the most structurally similar features, referred to as Structurally Similar Spatio-Temporal (STSS) features. These features are primarily located within the structure-focused blocks of the transformer during the later stages of denoising. BlockDance caches and reuses these highly similar features to mitigate redundant computation, thereby accelerating DiTs while maximizing consistency with the generated results of the original model. Furthermore, considering the diversity of generated content and the varying distributions of redundant features, we introduce BlockDance-Ada, a lightweight decision-making network tailored for instance-specific acceleration. BlockDance-Ada dynamically allocates resources and provides superior content quality. Both BlockDance and BlockDance-Ada have proven effective across various generation tasks and models, achieving accelerations between 25% and 50% while maintaining generation quality.
Related papers
- Model Reveals What to Cache: Profiling-Based Feature Reuse for Video Diffusion Models [41.11005178050448]
ProfilingDiT is a novel adaptive caching strategy that explicitly disentangles foreground and background-focused blocks.
Our framework achieves significant acceleration while maintaining visual fidelity across comprehensive quality metrics.
arXiv Detail & Related papers (2025-04-04T03:30:15Z) - Rethinking Video Tokenization: A Conditioned Diffusion-based Approach [58.164354605550194]
Diffusion Conditioned-based Gene Tokenizer replaces GAN-based decoder with conditional diffusion model.<n>We trained using only a basic MSE diffusion loss for reconstruction, along with KL term and LPIPS perceptual loss from scratch.<n>Even a scaled-down version of CDT (3$times$ inference speedup) still performs comparably with top baselines.
arXiv Detail & Related papers (2025-03-05T17:59:19Z) - Training-free and Adaptive Sparse Attention for Efficient Long Video Generation [31.615453637053793]
generating high-fidelity long videos with Diffusion Transformers (DiTs) is often hindered by significant latency.<n>We propose AdaSpa, the first Dynamic Pattern and Online Precise Search sparse attention method.<n>AdaSpa is implemented as an adaptive, plug-and-play solution and can be integrated seamlessly with existing DiTs.
arXiv Detail & Related papers (2025-02-28T14:11:20Z) - Ditto: Accelerating Diffusion Model via Temporal Value Similarity [4.5280087047319535]
We propose a difference processing algorithm that leverages temporal similarity with quantization to enhance the efficiency of diffusion models.<n>We also design the Ditto hardware, a specialized hardware accelerator, which achieves up to 1.5x speedup and 17.74% energy saving.
arXiv Detail & Related papers (2025-01-20T01:03:50Z) - AsymRnR: Video Diffusion Transformers Acceleration with Asymmetric Reduction and Restoration [45.62669899834342]
Diffusion Transformers (DiTs) have proven effective in generating high-quality videos but are hindered by high computational costs.<n>We propose Asymmetric Reduction and Restoration (AsymRnR), a training-free and model-agnostic method to accelerate video DiTs.
arXiv Detail & Related papers (2024-12-16T12:28:22Z) - Accelerating Vision Diffusion Transformers with Skip Branches [47.07564477125228]
Diffusion Transformers (DiT) are an emerging image and video generation model architecture.<n>DiT's practical deployment is constrained by computational complexity and redundancy in the sequential denoising process.<n>We introduce Skip-DiT, which converts standard DiT into Skip-DiT with skip branches to enhance feature smoothness.<n>We also introduce Skip-Cache which utilizes the skip branches to cache DiT features across timesteps at the inference time.
arXiv Detail & Related papers (2024-11-26T17:28:10Z) - Dynamic Diffusion Transformer [67.13876021157887]
Diffusion Transformer (DiT) has demonstrated superior performance but suffers from substantial computational costs.
We propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions during generation.
With 3% additional fine-tuning, our method reduces the FLOPs of DiT-XL by 51%, accelerates generation by 1.73, and achieves a competitive FID score of 2.07 on ImageNet.
arXiv Detail & Related papers (2024-10-04T14:14:28Z) - Robust Network Learning via Inverse Scale Variational Sparsification [55.64935887249435]
We introduce an inverse scale variational sparsification framework within a time-continuous inverse scale space formulation.
Unlike frequency-based methods, our approach not only removes noise by smoothing small-scale features.
We show the efficacy of our approach through enhanced robustness against various noise types.
arXiv Detail & Related papers (2024-09-27T03:17:35Z) - $Δ$-DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers [13.433352602762511]
We propose an overall training-free inference acceleration framework $Delta$-DiT.
$Delta$-DiT uses a designed cache mechanism to accelerate the rear DiT blocks in the early sampling stages and the front DiT blocks in the later stages.
Experiments on PIXART-$alpha$ and DiT-XL demonstrate that the $Delta$-DiT can achieve a $1.6times$ speedup on the 20-step generation.
arXiv Detail & Related papers (2024-06-03T09:10:44Z) - Lightweight Diffusion Models with Distillation-Based Block Neural
Architecture Search [55.41583104734349]
We propose to automatically remove structural redundancy in diffusion models with our proposed Diffusion Distillation-based Block-wise Neural Architecture Search (NAS)
Given a larger pretrained teacher, we leverage DiffNAS to search for the smallest architecture which can achieve on-par or even better performance than the teacher.
Different from previous block-wise NAS methods, DiffNAS contains a block-wise local search strategy and a retraining strategy with a joint dynamic loss.
arXiv Detail & Related papers (2023-11-08T12:56:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.