PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
- URL: http://arxiv.org/abs/2403.04692v2
- Date: Sun, 17 Mar 2024 16:59:25 GMT
- Title: PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
- Authors: Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, Zhenguo Li,
- Abstract summary: We introduce PixArt-Sigma, a Diffusion Transformer model capable of directly generating images at 4K resolution.
PixArt-Sigma offers images of markedly higher fidelity and improved alignment with text prompts.
- Score: 110.10627872744254
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this paper, we introduce PixArt-\Sigma, a Diffusion Transformer model~(DiT) capable of directly generating images at 4K resolution. PixArt-\Sigma represents a significant advancement over its predecessor, PixArt-\alpha, offering images of markedly higher fidelity and improved alignment with text prompts. A key feature of PixArt-\Sigma is its training efficiency. Leveraging the foundational pre-training of PixArt-\alpha, it evolves from the `weaker' baseline to a `stronger' model via incorporating higher quality data, a process we term "weak-to-strong training". The advancements in PixArt-\Sigma are twofold: (1) High-Quality Training Data: PixArt-\Sigma incorporates superior-quality image data, paired with more precise and detailed image captions. (2) Efficient Token Compression: we propose a novel attention module within the DiT framework that compresses both keys and values, significantly improving efficiency and facilitating ultra-high-resolution image generation. Thanks to these improvements, PixArt-\Sigma achieves superior image quality and user prompt adherence capabilities with significantly smaller model size (0.6B parameters) than existing text-to-image diffusion models, such as SDXL (2.6B parameters) and SD Cascade (5.1B parameters). Moreover, PixArt-\Sigma's capability to generate 4K images supports the creation of high-resolution posters and wallpapers, efficiently bolstering the production of high-quality visual content in industries such as film and gaming.
Related papers
- SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers [41.79064227895747]
Sana is a text-to-image framework that can generate images up to 4096$times$4096 resolution.
Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU.
arXiv Detail & Related papers (2024-10-14T15:36:42Z) - SD-$π$XL: Generating Low-Resolution Quantized Imagery via Score Distillation [64.40561867379627]
Low-resolution quantized imagery, such as pixel art, is seeing a revival in modern applications.
We introduce SD-$pi$XL, an approach for producing quantized images that employs score distillation sampling in conjunction with a differentiable image generator.
We show that our method is the ability to transform input images into low-resolution, quantized versions while retaining their key semantic features.
arXiv Detail & Related papers (2024-10-08T17:48:01Z) - Accurate Compression of Text-to-Image Diffusion Models via Vector Quantization [33.20136645196318]
State-of-the-art text-to-image models are becoming less accessible in practice.
Post-training quantization (PTQ) tackles this issue by compressing the pretrained model weights into lower-bit representations.
This work demonstrates that more versatile vector quantization (VQ) may achieve higher compression rates for large-scale text-to-image diffusion models.
arXiv Detail & Related papers (2024-08-31T16:09:20Z) - PIXART-{\delta}: Fast and Controllable Image Generation with Latent
Consistency Models [93.29160233752413]
PIXART-delta is a text-to-image synthesis framework.
It integrates the Latent Consistency Model (LCM) and ControlNet into the advanced PIXART-alpha model.
PIXART-delta achieves a breakthrough 0.5 seconds for generating 1024x1024 pixel images.
arXiv Detail & Related papers (2024-01-10T16:27:38Z) - PixArt-$\alpha$: Fast Training of Diffusion Transformer for
Photorealistic Text-to-Image Synthesis [108.83343447275206]
This paper introduces PIXART-$alpha$, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators.
It supports high-resolution image synthesis up to 1024px resolution with low training cost.
Tests demonstrate that PIXART-$alpha$ excels in image quality, artistry, and semantic control.
arXiv Detail & Related papers (2023-09-30T16:18:00Z) - Pixel Adapter: A Graph-Based Post-Processing Approach for Scene Text
Image Super-Resolution [22.60056946339325]
We propose the Pixel Adapter Module (PAM) based on graph attention to address pixel distortion caused by upsampling.
The PAM effectively captures local structural information by allowing each pixel to interact with its neighbors and update features.
We demonstrate that our proposed method generates high-quality super-resolution images, surpassing existing methods in recognition accuracy.
arXiv Detail & Related papers (2023-09-16T08:12:12Z) - Extreme Generative Image Compression by Learning Text Embedding from
Diffusion Models [13.894251782142584]
We propose a generative image compression method that demonstrates the potential of saving an image as a short text embedding.
Our method outperforms other state-of-the-art deep learning methods in terms of both perceptual quality and diversity.
arXiv Detail & Related papers (2022-11-14T22:54:19Z) - Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and
Restoration [71.6879432974126]
In this paper, we explore the novel Swin Transformer V2, to improve SwinIR for image super-resolution.
We conduct experiments on three representative tasks: JPEG compression artifacts removal, image super-resolution (classical and lightweight), and compressed image super-resolution.
Experimental results demonstrate that our method, Swin2SR, can improve the training convergence and performance of SwinIR.
arXiv Detail & Related papers (2022-09-22T23:25:08Z) - Towards Efficient and Scale-Robust Ultra-High-Definition Image
Demoireing [71.62289021118983]
We present an efficient baseline model ESDNet for tackling 4K moire images, wherein we build a semantic-aligned scale-aware module to address the scale variation of moire patterns.
Our approach outperforms state-of-the-art methods by a large margin while being much more lightweight.
arXiv Detail & Related papers (2022-07-20T14:20:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.