Related papers: Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers

Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers

URL: http://arxiv.org/abs/2401.11605v1
Date: Sun, 21 Jan 2024 21:49:49 GMT
Title: Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers
Authors: Katherine Crowson and Stefan Andreas Baumann and Alex Birch and Tanishq Mathew Abraham and Daniel Z. Kaplan and Enrico Shippole
Abstract summary: We present the Hourglass Diffusion Transformer (HDiT), an image generative model that exhibits linear scaling with pixel count, supporting training at high-resolution. Building on the Transformer architecture, which is known to scale to billions of parameters, it bridges the gap between the efficiency of convolutional U-Nets and the scalability of Transformers.
Score: 2.078423403798577
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present the Hourglass Diffusion Transformer (HDiT), an image generative model that exhibits linear scaling with pixel count, supporting training at high-resolution (e.g. $1024 \times 1024$) directly in pixel-space. Building on the Transformer architecture, which is known to scale to billions of parameters, it bridges the gap between the efficiency of convolutional U-Nets and the scalability of Transformers. HDiT trains successfully without typical high-resolution training techniques such as multiscale architectures, latent autoencoders or self-conditioning. We demonstrate that HDiT performs competitively with existing models on ImageNet $256^2$, and sets a new state-of-the-art for diffusion models on FFHQ-$1024^2$.

Related papers

FlowDCN: Exploring DCN-like Architectures for Fast Image Generation with Arbitrary Resolution [33.07779971446476]
We propose FlowDCN, a purely convolution-based generative model that can efficiently generate high-quality images at arbitrary resolutions. FlowDCN achieves the state-of-the-art 4.30 sFID on $256times256$ ImageNet Benchmark and comparable resolution extrapolation results. We believe FlowDCN offers a promising solution to scalable and flexible image synthesis.
arXiv Detail & Related papers (2024-10-30T02:48:50Z)
FiTv2: Scalable and Improved Flexible Vision Transformer for Diffusion Model [76.84519526283083]
We present the textbfFlexible Vision Transformer (FiT), a transformer architecture specifically designed for generating images with textitunrestricted resolutions and aspect ratios FiTv2 exhibits $2times$ convergence speed of FiT, when incorporating advanced training-free extrapolation techniques. Comprehensive experiments demonstrate the exceptional performance of FiTv2 across a broad range of resolutions.
arXiv Detail & Related papers (2024-10-17T15:51:49Z)
TerDiT: Ternary Diffusion Models with Transformers [83.94829676057692]
TerDiT is a quantization-aware training scheme for ternary diffusion models with transformers. We focus on the ternarization of DiT networks and scale model sizes from 600M to 4.2B.
arXiv Detail & Related papers (2024-05-23T17:57:24Z)
DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis [56.849285913695184]
Diffusion Mamba (DiM) is a sequence model for efficient high-resolution image synthesis. DiM architecture achieves inference-time efficiency for high-resolution images. Experiments demonstrate the effectiveness and efficiency of our DiM.
arXiv Detail & Related papers (2024-05-23T06:53:18Z)
DiffiT: Diffusion Vision Transformers for Image Generation [88.08529836125399]
Vision Transformer (ViT) has demonstrated strong modeling capabilities and scalability, especially for recognition tasks. We study the effectiveness of ViTs in diffusion-based generative learning and propose a new model denoted as Diffusion Vision Transformers (DiffiT) DiffiT is surprisingly effective in generating high-fidelity images with significantly better parameter efficiency.
arXiv Detail & Related papers (2023-12-04T18:57:01Z)
Scalable Diffusion Models with Transformers [18.903245758902834]
We train latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches. We find that DiTs with higher Gflops -- through increased transformer depth/width or increased number of input tokens -- consistently have lower FID.
arXiv Detail & Related papers (2022-12-19T18:59:58Z)
Improved Transformer for High-Resolution GANs [69.42469272015481]
We introduce two key ingredients to Transformer to address this challenge. We show in the experiments that the proposed HiT achieves state-of-the-art FID scores of 31.87 and 2.95 on unconditional ImageNet $128 times 128$ and FFHQ $256 times 256$, respectively.
arXiv Detail & Related papers (2021-06-14T17:39:49Z)
Multiscale Vision Transformers [79.76412415996892]
We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models. We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks.
arXiv Detail & Related papers (2021-04-22T17:59:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.