PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
- URL: http://arxiv.org/abs/2403.04692v2
- Date: Sun, 17 Mar 2024 16:59:25 GMT
- Title: PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
- Authors: Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, Zhenguo Li,
- Abstract summary: We introduce PixArt-Sigma, a Diffusion Transformer model capable of directly generating images at 4K resolution.
PixArt-Sigma offers images of markedly higher fidelity and improved alignment with text prompts.
- Score: 110.10627872744254
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this paper, we introduce PixArt-\Sigma, a Diffusion Transformer model~(DiT) capable of directly generating images at 4K resolution. PixArt-\Sigma represents a significant advancement over its predecessor, PixArt-\alpha, offering images of markedly higher fidelity and improved alignment with text prompts. A key feature of PixArt-\Sigma is its training efficiency. Leveraging the foundational pre-training of PixArt-\alpha, it evolves from the `weaker' baseline to a `stronger' model via incorporating higher quality data, a process we term "weak-to-strong training". The advancements in PixArt-\Sigma are twofold: (1) High-Quality Training Data: PixArt-\Sigma incorporates superior-quality image data, paired with more precise and detailed image captions. (2) Efficient Token Compression: we propose a novel attention module within the DiT framework that compresses both keys and values, significantly improving efficiency and facilitating ultra-high-resolution image generation. Thanks to these improvements, PixArt-\Sigma achieves superior image quality and user prompt adherence capabilities with significantly smaller model size (0.6B parameters) than existing text-to-image diffusion models, such as SDXL (2.6B parameters) and SD Cascade (5.1B parameters). Moreover, PixArt-\Sigma's capability to generate 4K images supports the creation of high-resolution posters and wallpapers, efficiently bolstering the production of high-quality visual content in industries such as film and gaming.
Related papers
- GlyphDraw2: Automatic Generation of Complex Glyph Posters with Diffusion Models and Large Language Models [7.5791485306093245]
Posters play a crucial role in marketing and advertising, contributing significantly to industrial design.
Despite advances in text rendering accuracy, the field of end-to-end poster generation remains underexplored.
We propose an end-to-end text rendering framework employing a triple cross-attention mechanism rooted in align learning.
arXiv Detail & Related papers (2024-07-02T13:17:49Z) - PIXART-{\delta}: Fast and Controllable Image Generation with Latent
Consistency Models [93.29160233752413]
PIXART-delta is a text-to-image synthesis framework.
It integrates the Latent Consistency Model (LCM) and ControlNet into the advanced PIXART-alpha model.
PIXART-delta achieves a breakthrough 0.5 seconds for generating 1024x1024 pixel images.
arXiv Detail & Related papers (2024-01-10T16:27:38Z) - PixArt-$\alpha$: Fast Training of Diffusion Transformer for
Photorealistic Text-to-Image Synthesis [108.83343447275206]
This paper introduces PIXART-$alpha$, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators.
It supports high-resolution image synthesis up to 1024px resolution with low training cost.
Tests demonstrate that PIXART-$alpha$ excels in image quality, artistry, and semantic control.
arXiv Detail & Related papers (2023-09-30T16:18:00Z) - Emu: Enhancing Image Generation Models Using Photogenic Needles in a
Haystack [75.00066365801993]
Training text-to-image models with web scale image-text pairs enables the generation of a wide range of visual concepts from text.
These pre-trained models often face challenges when it comes to generating highly aesthetic images.
We propose quality-tuning to guide a pre-trained model to exclusively generate highly visually appealing images.
arXiv Detail & Related papers (2023-09-27T17:30:19Z) - Pixel Adapter: A Graph-Based Post-Processing Approach for Scene Text
Image Super-Resolution [22.60056946339325]
We propose the Pixel Adapter Module (PAM) based on graph attention to address pixel distortion caused by upsampling.
The PAM effectively captures local structural information by allowing each pixel to interact with its neighbors and update features.
We demonstrate that our proposed method generates high-quality super-resolution images, surpassing existing methods in recognition accuracy.
arXiv Detail & Related papers (2023-09-16T08:12:12Z) - Extreme Generative Image Compression by Learning Text Embedding from
Diffusion Models [13.894251782142584]
We propose a generative image compression method that demonstrates the potential of saving an image as a short text embedding.
Our method outperforms other state-of-the-art deep learning methods in terms of both perceptual quality and diversity.
arXiv Detail & Related papers (2022-11-14T22:54:19Z) - Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and
Restoration [71.6879432974126]
In this paper, we explore the novel Swin Transformer V2, to improve SwinIR for image super-resolution.
We conduct experiments on three representative tasks: JPEG compression artifacts removal, image super-resolution (classical and lightweight), and compressed image super-resolution.
Experimental results demonstrate that our method, Swin2SR, can improve the training convergence and performance of SwinIR.
arXiv Detail & Related papers (2022-09-22T23:25:08Z) - Towards Efficient and Scale-Robust Ultra-High-Definition Image
Demoireing [71.62289021118983]
We present an efficient baseline model ESDNet for tackling 4K moire images, wherein we build a semantic-aligned scale-aware module to address the scale variation of moire patterns.
Our approach outperforms state-of-the-art methods by a large margin while being much more lightweight.
arXiv Detail & Related papers (2022-07-20T14:20:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.