STAR: Scale-wise Text-conditioned AutoRegressive image generation
- URL: http://arxiv.org/abs/2406.10797v4
- Date: Wed, 19 Feb 2025 06:00:55 GMT
- Title: STAR: Scale-wise Text-conditioned AutoRegressive image generation
- Authors: Xiaoxiao Ma, Mohan Zhou, Tao Liang, Yalong Bai, Tiejun Zhao, Biye Li, Huaian Chen, Yi Jin,
- Abstract summary: We introduce STAR, a text-to-image model that employs a scale-wise auto-regressive paradigm.
STAR enables text-driven image generation up to 1024$times$1024 through three key designs.
- Score: 38.98271279816512
- License:
- Abstract: We introduce STAR, a text-to-image model that employs a scale-wise auto-regressive paradigm. Unlike VAR, which is constrained to class-conditioned synthesis for images up to 256$\times$256, STAR enables text-driven image generation up to 1024$\times$1024 through three key designs. First, we introduce a pre-trained text encoder to extract and adopt representations for textual constraints, enhancing details and generalizability. Second, given the inherent structural correlation across different scales, we leverage 2D Rotary Positional Encoding (RoPE) and tweak it into a normalized version, ensuring consistent interpretation of relative positions across token maps and stabilizing the training process. Third, we observe that simultaneously sampling all tokens within a single scale can disrupt inter-token relationships, leading to structural instability, particularly in high-resolution generation. To address this, we propose a novel stable sampling method that incorporates causal relationships into the sampling process, ensuring both rich details and stable structures. Compared to previous diffusion models and auto-regressive models, STAR surpasses existing benchmarks in fidelity, text-image consistency, and aesthetic quality, requiring just 2.21s for 1024$\times$1024 images on A100. This highlights the potential of auto-regressive methods in high-quality image synthesis, offering new directions for the text-to-image generation.
Related papers
- Fusion is all you need: Face Fusion for Customized Identity-Preserving Image Synthesis [7.099258248662009]
Text-to-image (T2I) models have significantly advanced the development of artificial intelligence.
However, existing T2I-based methods often struggle to accurately reproduce the appearance of individuals from a reference image.
We leverage the pre-trained UNet from Stable Diffusion to incorporate the target face image directly into the generation process.
arXiv Detail & Related papers (2024-09-27T19:31:04Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z) - Scaling Autoregressive Models for Content-Rich Text-to-Image Generation [95.02406834386814]
Parti treats text-to-image generation as a sequence-to-sequence modeling problem.
Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens.
PartiPrompts (P2) is a new holistic benchmark of over 1600 English prompts.
arXiv Detail & Related papers (2022-06-22T01:11:29Z) - ERNIE-ViLG: Unified Generative Pre-training for Bidirectional
Vision-Language Generation [22.47279425592133]
We propose ERNIE-ViLG, a unified generative pre-training framework for bidirectional image-text generation.
For the text-to-image generation process, we propose an end-to-end training method to jointly learn the visual sequence generator and the image reconstructor.
We train a 10-billion parameter ERNIE-ViLG model on a large-scale dataset of 145 million (Chinese) image-text pairs.
arXiv Detail & Related papers (2021-12-31T03:53:33Z) - Cycle-Consistent Inverse GAN for Text-to-Image Synthesis [101.97397967958722]
We propose a novel unified framework of Cycle-consistent Inverse GAN for both text-to-image generation and text-guided image manipulation tasks.
We learn a GAN inversion model to convert the images back to the GAN latent space and obtain the inverted latent codes for each image.
In the text-guided optimization module, we generate images with the desired semantic attributes by optimizing the inverted latent codes.
arXiv Detail & Related papers (2021-08-03T08:38:16Z) - Towards Open-World Text-Guided Face Image Generation and Manipulation [52.83401421019309]
We propose a unified framework for both face image generation and manipulation.
Our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing.
arXiv Detail & Related papers (2021-04-18T16:56:07Z) - DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis [80.54273334640285]
We propose a novel one-stage text-to-image backbone that directly synthesizes high-resolution images without entanglements between different generators.
We also propose a novel Target-Aware Discriminator composed of Matching-Aware Gradient Penalty and One-Way Output.
Compared with current state-of-the-art methods, our proposed DF-GAN is simpler but more efficient to synthesize realistic and text-matching images.
arXiv Detail & Related papers (2020-08-13T12:51:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.