Related papers: $Δ$-DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers

$Δ$-DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers

URL: http://arxiv.org/abs/2406.01125v1
Date: Mon, 3 Jun 2024 09:10:44 GMT
Title: $Δ$-DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers
Authors: Pengtao Chen, Mingzhu Shen, Peng Ye, Jianjian Cao, Chongjun Tu, Christos-Savvas Bouganis, Yiren Zhao, Tao Chen,
Abstract summary: We propose an overall training-free inference acceleration framework $Delta$-DiT. $Delta$-DiT uses a designed cache mechanism to accelerate the rear DiT blocks in the early sampling stages and the front DiT blocks in the later stages. Experiments on PIXART-$alpha$ and DiT-XL demonstrate that the $Delta$-DiT can achieve a $1.6times$ speedup on the 20-step generation.
Score: 13.433352602762511
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion models are widely recognized for generating high-quality and diverse images, but their poor real-time performance has led to numerous acceleration works, primarily focusing on UNet-based structures. With the more successful results achieved by diffusion transformers (DiT), there is still a lack of exploration regarding the impact of DiT structure on generation, as well as the absence of an acceleration framework tailored to the DiT architecture. To tackle these challenges, we conduct an investigation into the correlation between DiT blocks and image generation. Our findings reveal that the front blocks of DiT are associated with the outline of the generated images, while the rear blocks are linked to the details. Based on this insight, we propose an overall training-free inference acceleration framework $\Delta$-DiT: using a designed cache mechanism to accelerate the rear DiT blocks in the early sampling stages and the front DiT blocks in the later stages. Specifically, a DiT-specific cache mechanism called $\Delta$-Cache is proposed, which considers the inputs of the previous sampling image and reduces the bias in the inference. Extensive experiments on PIXART-$\alpha$ and DiT-XL demonstrate that the $\Delta$-DiT can achieve a $1.6\times$ speedup on the 20-step generation and even improves performance in most cases. In the scenario of 4-step consistent model generation and the more challenging $1.12\times$ acceleration, our method significantly outperforms existing methods. Our code will be publicly available.

Related papers

DyDiT++: Dynamic Diffusion Transformers for Efficient Visual Generation [66.86241453156225]
Diffusion Transformer (DiT) has demonstrated superior performance but suffers from substantial computational costs. We propose textbfDynamic textbfDiffusion textbfTransformer (DyDiT) DyDiT adjusts its computation along both emphtimestep and emphspatial dimensions.
arXiv Detail & Related papers (2025-04-09T11:48:37Z)
Accelerating Vision Diffusion Transformers with Skip Branches [46.19946204953147]
Diffusion Transformers (DiT) are an emerging image and video generation model architecture. DiT's practical deployment is constrained by computational complexity and redundancy in the sequential denoising process. We introduce Skip-DiT, which converts standard DiT into Skip-DiT with skip branches to enhance feature smoothness. We also introduce Skip-Cache which utilizes the skip branches to cache DiT features across timesteps at the inference time.
arXiv Detail & Related papers (2024-11-26T17:28:10Z)
Dynamic Diffusion Transformer [67.13876021157887]
Diffusion Transformer (DiT) has demonstrated superior performance but suffers from substantial computational costs. We propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions during generation. With 3% additional fine-tuning, our method reduces the FLOPs of DiT-XL by 51%, accelerates generation by 1.73, and achieves a competitive FID score of 2.07 on ImageNet.
arXiv Detail & Related papers (2024-10-04T14:14:28Z)
HarmoniCa: Harmonizing Training and Inference for Better Feature Caching in Diffusion Transformer Acceleration [31.982294870690925]
We propose a novel learning-based caching framework dubbed HarmoniCa. It incorporates Step-Wise Denoising Training (SDT) to ensure the continuity of the denoising process. It also incorporates an Image Error Proxy-Guided Objective (IEPO) to balance image quality against cache utilization.
arXiv Detail & Related papers (2024-10-02T16:34:29Z)
Effective Diffusion Transformer Architecture for Image Super-Resolution [63.254644431016345]
We design an effective diffusion transformer for image super-resolution (DiT-SR) In practice, DiT-SR leverages an overall U-shaped architecture, and adopts a uniform isotropic design for all the transformer blocks. We analyze the limitation of the widely used AdaLN, and present a frequency-adaptive time-step conditioning module.
arXiv Detail & Related papers (2024-09-29T07:14:16Z)
DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention [82.24166963631949]
We introduce Diffusion Gated Linear Attention Transformers (DiG), a simple, adoptable solution with minimal parameter overhead, following the Diffusion Transformers (DiT) design. In addition to better performance than DiT, DiG-S/2 exhibits $2.5times$ higher training speed than DiT-S/2 and saves $75.7%$ memory resolution $179times 1792$. With the same model size, DiG-XL/2 is $4.2times$ faster than the recent Mamba-based diffusion model at a $1024$ resolution, and is $1.8times$ faster than DiT with FlashAttention-2
arXiv Detail & Related papers (2024-05-28T17:59:33Z)
TerDiT: Ternary Diffusion Models with Transformers [88.03738506648291]
TerDiT is the first quantization-aware training scheme for low-bit diffusion transformer models. We focus on the ternarization of DiT networks, with model sizes ranging from 600M to 4.2B, and image resolution from 256$times$256 to 512$times$512.
arXiv Detail & Related papers (2024-05-23T17:57:24Z)
T-Stitch: Accelerating Sampling in Pre-Trained Diffusion Models with Trajectory Stitching [143.72720563387082]
Trajectory Stitching T-Stitch is a simple yet efficient technique to improve the sampling efficiency with little or no generation degradation. Our key insight is that different diffusion models learn similar encodings under the same training data distribution. Our method can also be used as a drop-in technique to accelerate the popular pretrained stable diffusion (SD) models.
arXiv Detail & Related papers (2024-02-21T23:08:54Z)
Faster Diffusion: Rethinking the Role of the Encoder for Diffusion Model Inference [95.42299246592756]
We study the UNet encoder and empirically analyze the encoder features. We find that encoder features change minimally, whereas the decoder features exhibit substantial variations across different time-steps. We validate our approach on other tasks: text-to-video, personalized generation and reference-guided generation.
arXiv Detail & Related papers (2023-12-15T08:46:43Z)
Exploiting Activation based Gradient Output Sparsity to Accelerate Backpropagation in CNNs [15.465530153038927]
Machine/deep-learning (ML/DL) based techniques are emerging as a driving force behind many cutting-edge technologies. However, training these models involving large parameters is both time-consuming and energy-hogging.
arXiv Detail & Related papers (2021-09-16T04:12:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.