DyDiT++: Dynamic Diffusion Transformers for Efficient Visual Generation
- URL: http://arxiv.org/abs/2504.06803v2
- Date: Wed, 16 Apr 2025 04:46:22 GMT
- Title: DyDiT++: Dynamic Diffusion Transformers for Efficient Visual Generation
- Authors: Wangbo Zhao, Yizeng Han, Jiasheng Tang, Kai Wang, Hao Luo, Yibing Song, Gao Huang, Fan Wang, Yang You,
- Abstract summary: Diffusion Transformer (DiT) has demonstrated superior performance but suffers from substantial computational costs.<n>We propose textbfDynamic textbfDiffusion textbfTransformer (DyDiT)<n>DyDiT adjusts its computation along both emphtimestep and emphspatial dimensions.
- Score: 66.86241453156225
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion Transformer (DiT), an emerging diffusion model for visual generation, has demonstrated superior performance but suffers from substantial computational costs. Our investigations reveal that these costs primarily stem from the \emph{static} inference paradigm, which inevitably introduces redundant computation in certain \emph{diffusion timesteps} and \emph{spatial regions}. To overcome this inefficiency, we propose \textbf{Dy}namic \textbf{Di}ffusion \textbf{T}ransformer (DyDiT), an architecture that \emph{dynamically} adjusts its computation along both \emph{timestep} and \emph{spatial} dimensions. Specifically, we introduce a \emph{Timestep-wise Dynamic Width} (TDW) approach that adapts model width conditioned on the generation timesteps. In addition, we design a \emph{Spatial-wise Dynamic Token} (SDT) strategy to avoid redundant computation at unnecessary spatial locations. TDW and SDT can be seamlessly integrated into DiT and significantly accelerates the generation process. Building on these designs, we further enhance DyDiT in three key aspects. First, DyDiT is integrated seamlessly with flow matching-based generation, enhancing its versatility. Furthermore, we enhance DyDiT to tackle more complex visual generation tasks, including video generation and text-to-image generation, thereby broadening its real-world applications. Finally, to address the high cost of full fine-tuning and democratize technology access, we investigate the feasibility of training DyDiT in a parameter-efficient manner and introduce timestep-based dynamic LoRA (TD-LoRA). Extensive experiments on diverse visual generation models, including DiT, SiT, Latte, and FLUX, demonstrate the effectiveness of DyDiT.
Related papers
- Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models [33.519892081718716]
We propose aligning the latent space with pre-trained vision foundation models when training the visual tokenizers.<n>Our proposed VA-VAE significantly expands the reconstruction-generation frontier of latent diffusion models.<n>We build an enhanced DiT baseline with improved training strategies and architecture designs, termed LightningDiT.
arXiv Detail & Related papers (2025-01-02T18:59:40Z) - ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation [83.62931466231898]
This paper presents ARLON, a framework that boosts diffusion Transformers with autoregressive models for long video generation.<n>A latent Vector Quantized Variational Autoencoder (VQ-VAE) compresses the input latent space of the DiT model into compact visual tokens.<n>An adaptive norm-based semantic injection module integrates the coarse discrete visual units from the AR model into the DiT model.
arXiv Detail & Related papers (2024-10-27T16:28:28Z) - Dynamic Diffusion Transformer [67.13876021157887]
Diffusion Transformer (DiT) has demonstrated superior performance but suffers from substantial computational costs.
We propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions during generation.
With 3% additional fine-tuning, our method reduces the FLOPs of DiT-XL by 51%, accelerates generation by 1.73, and achieves a competitive FID score of 2.07 on ImageNet.
arXiv Detail & Related papers (2024-10-04T14:14:28Z) - MDT-A2G: Exploring Masked Diffusion Transformers for Co-Speech Gesture Generation [44.74056930805525]
We introduce a novel Masked Diffusion Transformer for co-speech gesture generation, referred to as MDT-A2G.
This model employs a mask modeling scheme specifically designed to strengthen temporal relation learning among sequence gestures.
Experimental results demonstrate that MDT-A2G excels in gesture generation, boasting a learning speed that is over 6$times$ faster than traditional diffusion transformers.
arXiv Detail & Related papers (2024-08-06T17:29:01Z) - Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT [120.39362661689333]
We present an improved version of Lumina-T2X, showcasing stronger generation performance with increased training and inference efficiency.
Thanks to these improvements, Lumina-Next not only improves the quality and efficiency of basic text-to-image generation but also demonstrates superior resolution extrapolation capabilities.
arXiv Detail & Related papers (2024-06-05T17:53:26Z) - $Δ$-DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers [13.433352602762511]
We propose an overall training-free inference acceleration framework $Delta$-DiT.
$Delta$-DiT uses a designed cache mechanism to accelerate the rear DiT blocks in the early sampling stages and the front DiT blocks in the later stages.
Experiments on PIXART-$alpha$ and DiT-XL demonstrate that the $Delta$-DiT can achieve a $1.6times$ speedup on the 20-step generation.
arXiv Detail & Related papers (2024-06-03T09:10:44Z) - DiffiT: Diffusion Vision Transformers for Image Generation [88.08529836125399]
Vision Transformer (ViT) has demonstrated strong modeling capabilities and scalability, especially for recognition tasks.
We study the effectiveness of ViTs in diffusion-based generative learning and propose a new model denoted as Diffusion Vision Transformers (DiffiT)
DiffiT is surprisingly effective in generating high-fidelity images with significantly better parameter efficiency.
arXiv Detail & Related papers (2023-12-04T18:57:01Z) - The Missing U for Efficient Diffusion Models [3.712196074875643]
Diffusion Probabilistic Models yield record-breaking performance in tasks such as image synthesis, video generation, and molecule design.
Despite their capabilities, their efficiency, especially in the reverse process, remains a challenge due to slow convergence rates and high computational costs.
We introduce an approach that leverages continuous dynamical systems to design a novel denoising network for diffusion models.
arXiv Detail & Related papers (2023-10-31T00:12:14Z) - Global Vision Transformer Pruning with Hessian-Aware Saliency [93.33895899995224]
This work challenges the common design philosophy of the Vision Transformer (ViT) model with uniform dimension across all the stacked blocks in a model stage.
We derive a novel Hessian-based structural pruning criteria comparable across all layers and structures, with latency-aware regularization for direct latency reduction.
Performing iterative pruning on the DeiT-Base model leads to a new architecture family called NViT (Novel ViT), with a novel parameter that utilizes parameters more efficiently.
arXiv Detail & Related papers (2021-10-10T18:04:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.