PIXART-{\delta}: Fast and Controllable Image Generation with Latent
Consistency Models
- URL: http://arxiv.org/abs/2401.05252v1
- Date: Wed, 10 Jan 2024 16:27:38 GMT
- Title: PIXART-{\delta}: Fast and Controllable Image Generation with Latent
Consistency Models
- Authors: Junsong Chen, Yue Wu, Simian Luo, Enze Xie, Sayak Paul, Ping Luo, Hang
Zhao, Zhenguo Li
- Abstract summary: PIXART-delta is a text-to-image synthesis framework.
It integrates the Latent Consistency Model (LCM) and ControlNet into the advanced PIXART-alpha model.
PIXART-delta achieves a breakthrough 0.5 seconds for generating 1024x1024 pixel images.
- Score: 93.29160233752413
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This technical report introduces PIXART-{\delta}, a text-to-image synthesis
framework that integrates the Latent Consistency Model (LCM) and ControlNet
into the advanced PIXART-{\alpha} model. PIXART-{\alpha} is recognized for its
ability to generate high-quality images of 1024px resolution through a
remarkably efficient training process. The integration of LCM in
PIXART-{\delta} significantly accelerates the inference speed, enabling the
production of high-quality images in just 2-4 steps. Notably, PIXART-{\delta}
achieves a breakthrough 0.5 seconds for generating 1024x1024 pixel images,
marking a 7x improvement over the PIXART-{\alpha}. Additionally,
PIXART-{\delta} is designed to be efficiently trainable on 32GB V100 GPUs
within a single day. With its 8-bit inference capability (von Platen et al.,
2023), PIXART-{\delta} can synthesize 1024px images within 8GB GPU memory
constraints, greatly enhancing its usability and accessibility. Furthermore,
incorporating a ControlNet-like module enables fine-grained control over
text-to-image diffusion models. We introduce a novel ControlNet-Transformer
architecture, specifically tailored for Transformers, achieving explicit
controllability alongside high-quality image generation. As a state-of-the-art,
open-source image generation model, PIXART-{\delta} offers a promising
alternative to the Stable Diffusion family of models, contributing
significantly to text-to-image synthesis.
Related papers
- FlowDCN: Exploring DCN-like Architectures for Fast Image Generation with Arbitrary Resolution [33.07779971446476]
We propose FlowDCN, a purely convolution-based generative model that can efficiently generate high-quality images at arbitrary resolutions.
FlowDCN achieves the state-of-the-art 4.30 sFID on $256times256$ ImageNet Benchmark and comparable resolution extrapolation results.
We believe FlowDCN offers a promising solution to scalable and flexible image synthesis.
arXiv Detail & Related papers (2024-10-30T02:48:50Z) - OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation [95.29102596532854]
Tokenizer serves as a translator to map the intricate visual data into a compact latent space.
This paper presents OmniTokenizer, a transformer-based tokenizer for joint image and video tokenization.
arXiv Detail & Related papers (2024-06-13T17:59:26Z) - An Image is Worth 32 Tokens for Reconstruction and Generation [54.24414696392026]
Transformer-based 1-Dimensional Tokenizer (TiTok) is an innovative approach that tokenizes images into 1D latent sequences.
TiTok achieves competitive performance to state-of-the-art approaches.
Our best-performing variant can significantly surpasses DiT-XL/2 (gFID 2.13 vs. 3.04) while still generating high-quality samples 74x faster.
arXiv Detail & Related papers (2024-06-11T17:59:56Z) - PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation [110.10627872744254]
We introduce PixArt-Sigma, a Diffusion Transformer model capable of directly generating images at 4K resolution.
PixArt-Sigma offers images of markedly higher fidelity and improved alignment with text prompts.
arXiv Detail & Related papers (2024-03-07T17:41:37Z) - Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass
Diffusion Transformers [2.078423403798577]
We present the Hourglass Diffusion Transformer (HDiT), an image generative model that exhibits linear scaling with pixel count, supporting training at high-resolution.
Building on the Transformer architecture, which is known to scale to billions of parameters, it bridges the gap between the efficiency of convolutional U-Nets and the scalability of Transformers.
arXiv Detail & Related papers (2024-01-21T21:49:49Z) - PixArt-$\alpha$: Fast Training of Diffusion Transformer for
Photorealistic Text-to-Image Synthesis [108.83343447275206]
This paper introduces PIXART-$alpha$, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators.
It supports high-resolution image synthesis up to 1024px resolution with low training cost.
Tests demonstrate that PIXART-$alpha$ excels in image quality, artistry, and semantic control.
arXiv Detail & Related papers (2023-09-30T16:18:00Z) - CoordFill: Efficient High-Resolution Image Inpainting via Parameterized
Coordinate Querying [52.91778151771145]
In this paper, we try to break the limitations for the first time thanks to the recent development of continuous implicit representation.
Experiments show that the proposed method achieves real-time performance on the 2048$times$2048 images using a single GTX 2080 Ti GPU.
arXiv Detail & Related papers (2023-03-15T11:13:51Z) - ImageSig: A signature transform for ultra-lightweight image recognition [0.0]
ImageSig is based on computing signatures and does not require a convolutional structure or an attention-based encoder.
ImageSig shows unprecedented performance on hardware such as Raspberry Pi and Jetson-nano.
arXiv Detail & Related papers (2022-05-13T23:48:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.