Progressive Text-to-Image Generation
- URL: http://arxiv.org/abs/2210.02291v5
- Date: Wed, 20 Sep 2023 06:55:27 GMT
- Title: Progressive Text-to-Image Generation
- Authors: Zhengcong Fei, Mingyuan Fan, Li Zhu, Junshi Huang
- Abstract summary: We present a progressive model for high-fidelity text-to-image generation.
The proposed method takes effect by creating new image tokens from coarse to fine based on the existing context.
The resulting coarse-to-fine hierarchy makes the image generation process intuitive and interpretable.
- Score: 40.09326229583334
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, Vector Quantized AutoRegressive (VQ-AR) models have shown
remarkable results in text-to-image synthesis by equally predicting discrete
image tokens from the top left to bottom right in the latent space. Although
the simple generative process surprisingly works well, is this the best way to
generate the image? For instance, human creation is more inclined to the
outline-to-fine of an image, while VQ-AR models themselves do not consider any
relative importance of image patches. In this paper, we present a progressive
model for high-fidelity text-to-image generation. The proposed method takes
effect by creating new image tokens from coarse to fine based on the existing
context in a parallel manner, and this procedure is recursively applied with
the proposed error revision mechanism until an image sequence is completed. The
resulting coarse-to-fine hierarchy makes the image generation process intuitive
and interpretable. Extensive experiments in MS COCO benchmark demonstrate that
the progressive model produces significantly better results compared with the
previous VQ-AR method in FID score across a wide variety of categories and
aspects. Moreover, the design of parallel generation in each step allows more
than $\times 13$ inference acceleration with slight performance loss.
Related papers
- Randomized Autoregressive Visual Generation [26.195148077398223]
This paper presents Randomized AutoRegressive modeling (RAR) for visual generation.
RAR sets a new state-of-the-art performance on the image generation task while maintaining full compatibility with language modeling frameworks.
On the ImageNet-256 benchmark, RAR achieves an FID score of 1.48, not only surpassing prior state-the-art autoregressive image generators but also outperforming leading diffusion-based and masked transformer-based methods.
arXiv Detail & Related papers (2024-11-01T17:59:58Z) - ImageFolder: Autoregressive Image Generation with Folded Tokens [51.815319504939396]
Increasing token length is a common approach to improve the image reconstruction quality.
There exists a trade-off between reconstruction and generation quality regarding token length.
We propose Image, a semantic tokenizer that provides spatially aligned image tokens that can be folded during autoregressive modeling.
arXiv Detail & Related papers (2024-10-02T17:06:39Z) - Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding [60.188309982690335]
We propose a training-free probabilistic parallel decoding algorithm, Speculative Jacobi Decoding (SJD), to accelerate auto-regressive text-to-image generation.
By introducing a probabilistic convergence criterion, our SJD accelerates the inference of auto-regressive text-to-image generation while maintaining the randomness in sampling-based token decoding.
arXiv Detail & Related papers (2024-10-02T16:05:27Z) - DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation [56.514462874501675]
We propose a dynamic sparse attention based Transformer model to achieve fine-level matching with favorable efficiency.
The heart of our approach is a novel dynamic-attention unit, dedicated to covering the variation on the optimal number of tokens one position should focus on.
Experiments on three applications, pose-guided person image generation, edge-based face synthesis, and undistorted image style transfer, demonstrate that DynaST achieves superior performance in local details.
arXiv Detail & Related papers (2022-07-13T11:12:03Z) - Draft-and-Revise: Effective Image Generation with Contextual
RQ-Transformer [40.04085054791994]
We propose an effective image generation framework of Draft-and-Revise with Contextual RQ-transformer to consider global contexts during the generation process.
In experiments, our method achieves state-of-the-art results on conditional image generation.
arXiv Detail & Related papers (2022-06-09T12:25:24Z) - Autoregressive Image Generation using Residual Quantization [40.04085054791994]
We propose a two-stage framework to generate high-resolution images.
The framework consists of Residual-Quantized VAE (RQ-VAE) and RQ-Transformer.
Our approach has a significantly faster sampling speed than previous AR models to generate high-quality images.
arXiv Detail & Related papers (2022-03-03T11:44:46Z) - Vector Quantized Diffusion Model for Text-to-Image Synthesis [47.09451151258849]
We present the vector quantized diffusion (VQ-Diffusion) model for text-to-image generation.
Our experiments show that the VQ-Diffusion produces significantly better text-to-image generation results.
arXiv Detail & Related papers (2021-11-29T18:59:46Z) - Semi-Autoregressive Image Captioning [153.9658053662605]
Current state-of-the-art approaches for image captioning typically adopt an autoregressive manner.
Non-autoregressive image captioning with continuous iterative refinement can achieve comparable performance to the autoregressive counterparts with a considerable acceleration.
We propose a novel two-stage framework, referred to as Semi-Autoregressive Image Captioning (SAIC) to make a better trade-off between performance and speed.
arXiv Detail & Related papers (2021-10-11T15:11:54Z) - The Power of Triply Complementary Priors for Image Compressive Sensing [89.14144796591685]
We propose a joint low-rank deep (LRD) image model, which contains a pair of complementaryly trip priors.
We then propose a novel hybrid plug-and-play framework based on the LRD model for image CS.
To make the optimization tractable, a simple yet effective algorithm is proposed to solve the proposed H-based image CS problem.
arXiv Detail & Related papers (2020-05-16T08:17:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.