PixArt-$\alpha$: Fast Training of Diffusion Transformer for
Photorealistic Text-to-Image Synthesis
- URL: http://arxiv.org/abs/2310.00426v3
- Date: Fri, 29 Dec 2023 16:42:08 GMT
- Title: PixArt-$\alpha$: Fast Training of Diffusion Transformer for
Photorealistic Text-to-Image Synthesis
- Authors: Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu,
Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, Zhenguo Li
- Abstract summary: This paper introduces PIXART-$alpha$, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators.
It supports high-resolution image synthesis up to 1024px resolution with low training cost.
Tests demonstrate that PIXART-$alpha$ excels in image quality, artistry, and semantic control.
- Score: 108.83343447275206
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The most advanced text-to-image (T2I) models require significant training
costs (e.g., millions of GPU hours), seriously hindering the fundamental
innovation for the AIGC community while increasing CO2 emissions. This paper
introduces PIXART-$\alpha$, a Transformer-based T2I diffusion model whose image
generation quality is competitive with state-of-the-art image generators (e.g.,
Imagen, SDXL, and even Midjourney), reaching near-commercial application
standards. Additionally, it supports high-resolution image synthesis up to
1024px resolution with low training cost, as shown in Figure 1 and 2. To
achieve this goal, three core designs are proposed: (1) Training strategy
decomposition: We devise three distinct training steps that separately optimize
pixel dependency, text-image alignment, and image aesthetic quality; (2)
Efficient T2I Transformer: We incorporate cross-attention modules into
Diffusion Transformer (DiT) to inject text conditions and streamline the
computation-intensive class-condition branch; (3) High-informative data: We
emphasize the significance of concept density in text-image pairs and leverage
a large Vision-Language model to auto-label dense pseudo-captions to assist
text-image alignment learning. As a result, PIXART-$\alpha$'s training speed
markedly surpasses existing large-scale T2I models, e.g., PIXART-$\alpha$ only
takes 10.8% of Stable Diffusion v1.5's training time (675 vs. 6,250 A100 GPU
days), saving nearly \$300,000 (\$26,000 vs. \$320,000) and reducing 90% CO2
emissions. Moreover, compared with a larger SOTA model, RAPHAEL, our training
cost is merely 1%. Extensive experiments demonstrate that PIXART-$\alpha$
excels in image quality, artistry, and semantic control. We hope
PIXART-$\alpha$ will provide new insights to the AIGC community and startups to
accelerate building their own high-quality yet low-cost generative models from
scratch.
Related papers
- SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers [41.79064227895747]
Sana is a text-to-image framework that can generate images up to 4096$times$4096 resolution.
Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU.
arXiv Detail & Related papers (2024-10-14T15:36:42Z) - Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget [53.311109531586844]
We demonstrate very low-cost training of large-scale T2I diffusion transformer models.
We train a 1.16 billion parameter sparse transformer with only $1,890 economical cost and achieve a 12.7 FID in zero-shot generation.
We aim to release our end-to-end training pipeline to further democratize the training of large-scale diffusion models on micro-budgets.
arXiv Detail & Related papers (2024-07-22T17:23:28Z) - PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation [110.10627872744254]
We introduce PixArt-Sigma, a Diffusion Transformer model capable of directly generating images at 4K resolution.
PixArt-Sigma offers images of markedly higher fidelity and improved alignment with text prompts.
arXiv Detail & Related papers (2024-03-07T17:41:37Z) - PIXART-{\delta}: Fast and Controllable Image Generation with Latent
Consistency Models [93.29160233752413]
PIXART-delta is a text-to-image synthesis framework.
It integrates the Latent Consistency Model (LCM) and ControlNet into the advanced PIXART-alpha model.
PIXART-delta achieves a breakthrough 0.5 seconds for generating 1024x1024 pixel images.
arXiv Detail & Related papers (2024-01-10T16:27:38Z) - CommonCanvas: An Open Diffusion Model Trained with Creative-Commons
Images [19.62509002853736]
We assemble a dataset of Creative-Commons-licensed (CC) images to train text-to-image generative models.
We use an intuitive transfer learning technique to produce a set of high-quality synthetic captions paired with curated CC images.
We develop a data- and compute-efficient training recipe that requires as little as 3% of the LAION-2B data needed to train existing SD2 models, but obtains comparable quality.
arXiv Detail & Related papers (2023-10-25T17:56:07Z) - Emu: Enhancing Image Generation Models Using Photogenic Needles in a
Haystack [75.00066365801993]
Training text-to-image models with web scale image-text pairs enables the generation of a wide range of visual concepts from text.
These pre-trained models often face challenges when it comes to generating highly aesthetic images.
We propose quality-tuning to guide a pre-trained model to exclusively generate highly visually appealing images.
arXiv Detail & Related papers (2023-09-27T17:30:19Z) - SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two
Seconds [88.06788636008051]
Text-to-image diffusion models can create stunning images from natural language descriptions that rival the work of professional artists and photographers.
These models are large, with complex network architectures and tens of denoising iterations, making them computationally expensive and slow to run.
We present a generic approach that unlocks running text-to-image diffusion models on mobile devices in less than $2$ seconds.
arXiv Detail & Related papers (2023-06-01T17:59:25Z) - Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image
Diffusion Models [6.821399706256863]
W"urstchen is a novel architecture for text-to-image synthesis that combines competitive performance with unprecedented cost-effectiveness.
A key contribution of our work is to develop a latent diffusion technique in which we learn a detailed but extremely compact semantic image representation.
arXiv Detail & Related papers (2023-06-01T13:00:53Z) - Towards Faster and Stabilized GAN Training for High-fidelity Few-shot
Image Synthesis [21.40315235087551]
We propose a light-weight GAN structure that gains superior quality on 1024*1024 resolution.
We show our model's superior performance compared to the state-of-the-art StyleGAN2, when data and computing budget are limited.
arXiv Detail & Related papers (2021-01-12T22:02:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.