Scalable Diffusion Models with Transformers
- URL: http://arxiv.org/abs/2212.09748v1
- Date: Mon, 19 Dec 2022 18:59:58 GMT
- Title: Scalable Diffusion Models with Transformers
- Authors: William Peebles, Saining Xie
- Abstract summary: We train latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches.
We find that DiTs with higher Gflops -- through increased transformer depth/width or increased number of input tokens -- consistently have lower FID.
- Score: 18.903245758902834
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We explore a new class of diffusion models based on the transformer
architecture. We train latent diffusion models of images, replacing the
commonly-used U-Net backbone with a transformer that operates on latent
patches. We analyze the scalability of our Diffusion Transformers (DiTs)
through the lens of forward pass complexity as measured by Gflops. We find that
DiTs with higher Gflops -- through increased transformer depth/width or
increased number of input tokens -- consistently have lower FID. In addition to
possessing good scalability properties, our largest DiT-XL/2 models outperform
all prior diffusion models on the class-conditional ImageNet 512x512 and
256x256 benchmarks, achieving a state-of-the-art FID of 2.27 on the latter.
Related papers
- FiTv2: Scalable and Improved Flexible Vision Transformer for Diffusion Model [76.84519526283083]
We present the textbfFlexible Vision Transformer (FiT), a transformer architecture specifically designed for generating images with textitunrestricted resolutions and aspect ratios
FiTv2 exhibits $2times$ convergence speed of FiT, when incorporating advanced training-free extrapolation techniques.
Comprehensive experiments demonstrate the exceptional performance of FiTv2 across a broad range of resolutions.
arXiv Detail & Related papers (2024-10-17T15:51:49Z) - Dynamic Diffusion Transformer [67.13876021157887]
Diffusion Transformer (DiT) has demonstrated superior performance but suffers from substantial computational costs.
We propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions during generation.
With 3% additional fine-tuning, our method reduces the FLOPs of DiT-XL by 51%, accelerates generation by 1.73, and achieves a competitive FID score of 2.07 on ImageNet.
arXiv Detail & Related papers (2024-10-04T14:14:28Z) - DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention [82.24166963631949]
We introduce Diffusion Gated Linear Attention Transformers (DiG), a simple, adoptable solution with minimal parameter overhead, following the Diffusion Transformers (DiT) design.
In addition to better performance than DiT, DiG-S/2 exhibits $2.5times$ higher training speed than DiT-S/2 and saves $75.7%$ memory resolution $179times 1792$.
With the same model size, DiG-XL/2 is $4.2times$ faster than the recent Mamba-based diffusion model at a $1024$ resolution, and is $1.8times$ faster than DiT with FlashAttention-2
arXiv Detail & Related papers (2024-05-28T17:59:33Z) - TerDiT: Ternary Diffusion Models with Transformers [83.94829676057692]
TerDiT is a quantization-aware training scheme for ternary diffusion models with transformers.
We focus on the ternarization of DiT networks and scale model sizes from 600M to 4.2B.
arXiv Detail & Related papers (2024-05-23T17:57:24Z) - Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass
Diffusion Transformers [2.078423403798577]
We present the Hourglass Diffusion Transformer (HDiT), an image generative model that exhibits linear scaling with pixel count, supporting training at high-resolution.
Building on the Transformer architecture, which is known to scale to billions of parameters, it bridges the gap between the efficiency of convolutional U-Nets and the scalability of Transformers.
arXiv Detail & Related papers (2024-01-21T21:49:49Z) - SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers [33.15117998855855]
generative models built on the backbone of Diffusion Transformers (DiT)
Interpolant framework allows for connecting two distributions in a more flexible way than standard diffusion models.
SiT surpasses DiT uniformly across model sizes on the conditional ImageNet 256x256 and 512x512 benchmark.
arXiv Detail & Related papers (2024-01-16T18:55:25Z) - DiffiT: Diffusion Vision Transformers for Image Generation [88.08529836125399]
Vision Transformer (ViT) has demonstrated strong modeling capabilities and scalability, especially for recognition tasks.
We study the effectiveness of ViTs in diffusion-based generative learning and propose a new model denoted as Diffusion Vision Transformers (DiffiT)
DiffiT is surprisingly effective in generating high-fidelity images with significantly better parameter efficiency.
arXiv Detail & Related papers (2023-12-04T18:57:01Z) - Improved Transformer for High-Resolution GANs [69.42469272015481]
We introduce two key ingredients to Transformer to address this challenge.
We show in the experiments that the proposed HiT achieves state-of-the-art FID scores of 31.87 and 2.95 on unconditional ImageNet $128 times 128$ and FFHQ $256 times 256$, respectively.
arXiv Detail & Related papers (2021-06-14T17:39:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.