Related papers: PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

URL: http://arxiv.org/abs/2403.04692v2
Date: Sun, 17 Mar 2024 16:59:25 GMT
Title: PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
Authors: Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, Zhenguo Li,
Abstract summary: We introduce PixArt-Sigma, a Diffusion Transformer model capable of directly generating images at 4K resolution. PixArt-Sigma offers images of markedly higher fidelity and improved alignment with text prompts.
Score: 110.10627872744254
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In this paper, we introduce PixArt-\Sigma, a Diffusion Transformer model~(DiT) capable of directly generating images at 4K resolution. PixArt-\Sigma represents a significant advancement over its predecessor, PixArt-\alpha, offering images of markedly higher fidelity and improved alignment with text prompts. A key feature of PixArt-\Sigma is its training efficiency. Leveraging the foundational pre-training of PixArt-\alpha, it evolves from the `weaker' baseline to a `stronger' model via incorporating higher quality data, a process we term "weak-to-strong training". The advancements in PixArt-\Sigma are twofold: (1) High-Quality Training Data: PixArt-\Sigma incorporates superior-quality image data, paired with more precise and detailed image captions. (2) Efficient Token Compression: we propose a novel attention module within the DiT framework that compresses both keys and values, significantly improving efficiency and facilitating ultra-high-resolution image generation. Thanks to these improvements, PixArt-\Sigma achieves superior image quality and user prompt adherence capabilities with significantly smaller model size (0.6B parameters) than existing text-to-image diffusion models, such as SDXL (2.6B parameters) and SD Cascade (5.1B parameters). Moreover, PixArt-\Sigma's capability to generate 4K images supports the creation of high-resolution posters and wallpapers, efficiently bolstering the production of high-quality visual content in industries such as film and gaming.

Related papers

Why Compress What You Can Generate? When GPT-4o Generation Ushers in Image Compression Fields [14.805239427360208]
AIGC foundation models are powerful enough to faithfully generate intricate structure and fine-grained details from nothing more than compact descriptors. Recent GPT-4o image generation of OpenAI has achieved impressive cross-modality generation, editing, and design capabilities.
arXiv Detail & Related papers (2025-04-30T17:20:14Z)
Pix2Cap-COCO: Advancing Visual Comprehension via Pixel-Level Captioning [36.33160773256632]
We present Pix2Cap-COCO, the first panoptic pixel-level caption dataset designed to advance fine-grained visual understanding. This approach results in 167,254 detailed captions, with an average of 22.94 words per caption. We also introduce a novel task, panoptic segmentation-captioning, which challenges models to recognize instances in an image and provide detailed descriptions for each simultaneously.
arXiv Detail & Related papers (2025-01-23T18:08:57Z)
Semantics Prompting Data-Free Quantization for Low-Bit Vision Transformers [59.772673692679085]
We propose SPDFQ, a Semantics Prompting Data-Free Quantization method for ViTs. First, SPDFQ incorporates Attention Priors Alignment (APA), which uses randomly generated attention priors to enhance the semantics of synthetic images. Second, SPDFQ introduces Multi-Semantic Reinforcement (MSR), which utilizes localized patch optimization to prompt efficient parameterization.
arXiv Detail & Related papers (2024-12-21T09:30:45Z)
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers [41.79064227895747]
Sana is a text-to-image framework that can generate images up to 4096$times$4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU.
arXiv Detail & Related papers (2024-10-14T15:36:42Z)
SD-$π$XL: Generating Low-Resolution Quantized Imagery via Score Distillation [64.40561867379627]
Low-resolution quantized imagery, such as pixel art, is seeing a revival in modern applications. We introduce SD-$pi$XL, an approach for producing quantized images that employs score distillation sampling in conjunction with a differentiable image generator. We show that our method is the ability to transform input images into low-resolution, quantized versions while retaining their key semantic features.
arXiv Detail & Related papers (2024-10-08T17:48:01Z)
Accurate Compression of Text-to-Image Diffusion Models via Vector Quantization [33.20136645196318]
State-of-the-art text-to-image models are becoming less accessible in practice. Post-training quantization (PTQ) tackles this issue by compressing the pretrained model weights into lower-bit representations. This work demonstrates that more versatile vector quantization (VQ) may achieve higher compression rates for large-scale text-to-image diffusion models.
arXiv Detail & Related papers (2024-08-31T16:09:20Z)
PIXART-{\delta}: Fast and Controllable Image Generation with Latent Consistency Models [93.29160233752413]
PIXART-delta is a text-to-image synthesis framework. It integrates the Latent Consistency Model (LCM) and ControlNet into the advanced PIXART-alpha model. PIXART-delta achieves a breakthrough 0.5 seconds for generating 1024x1024 pixel images.
arXiv Detail & Related papers (2024-01-10T16:27:38Z)
PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis [108.83343447275206]
This paper introduces PIXART-$alpha$, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators. It supports high-resolution image synthesis up to 1024px resolution with low training cost. Tests demonstrate that PIXART-$alpha$ excels in image quality, artistry, and semantic control.
arXiv Detail & Related papers (2023-09-30T16:18:00Z)
Pixel Adapter: A Graph-Based Post-Processing Approach for Scene Text Image Super-Resolution [22.60056946339325]
We propose the Pixel Adapter Module (PAM) based on graph attention to address pixel distortion caused by upsampling. The PAM effectively captures local structural information by allowing each pixel to interact with its neighbors and update features. We demonstrate that our proposed method generates high-quality super-resolution images, surpassing existing methods in recognition accuracy.
arXiv Detail & Related papers (2023-09-16T08:12:12Z)
Extreme Generative Image Compression by Learning Text Embedding from Diffusion Models [13.894251782142584]
We propose a generative image compression method that demonstrates the potential of saving an image as a short text embedding. Our method outperforms other state-of-the-art deep learning methods in terms of both perceptual quality and diversity.
arXiv Detail & Related papers (2022-11-14T22:54:19Z)
Swin2SR: SwinV2 Transformer for Compressed Image Super-Resolution and Restoration [71.6879432974126]
In this paper, we explore the novel Swin Transformer V2, to improve SwinIR for image super-resolution. We conduct experiments on three representative tasks: JPEG compression artifacts removal, image super-resolution (classical and lightweight), and compressed image super-resolution. Experimental results demonstrate that our method, Swin2SR, can improve the training convergence and performance of SwinIR.
arXiv Detail & Related papers (2022-09-22T23:25:08Z)
Towards Efficient and Scale-Robust Ultra-High-Definition Image Demoireing [71.62289021118983]
We present an efficient baseline model ESDNet for tackling 4K moire images, wherein we build a semantic-aligned scale-aware module to address the scale variation of moire patterns. Our approach outperforms state-of-the-art methods by a large margin while being much more lightweight.
arXiv Detail & Related papers (2022-07-20T14:20:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.