Emage: Non-Autoregressive Text-to-Image Generation
- URL: http://arxiv.org/abs/2312.14988v1
- Date: Fri, 22 Dec 2023 10:01:54 GMT
- Title: Emage: Non-Autoregressive Text-to-Image Generation
- Authors: Zhangyin Feng, Runyi Hu, Liangxin Liu, Fan Zhang, Duyu Tang, Yong Dai,
Xiaocheng Feng, Jiwei Li, Bing Qin, Shuming Shi
- Abstract summary: Non-autoregressive text-to-image models efficiently generate hundreds of image tokens in parallel.
Our model with 346M parameters generates an image of 256$times$256 with about one second on one V100 GPU.
- Score: 63.347052548210236
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autoregressive and diffusion models drive the recent breakthroughs on
text-to-image generation. Despite their huge success of generating
high-realistic images, a common shortcoming of these models is their high
inference latency - autoregressive models run more than a thousand times
successively to produce image tokens and diffusion models convert Gaussian
noise into images with many hundreds of denoising steps. In this work, we
explore non-autoregressive text-to-image models that efficiently generate
hundreds of image tokens in parallel. We develop many model variations with
different learning and inference strategies, initialized text encoders, etc.
Compared with autoregressive baselines that needs to run one thousand times,
our model only runs 16 times to generate images of competitive quality with an
order of magnitude lower inference latency. Our non-autoregressive model with
346M parameters generates an image of 256$\times$256 with about one second on
one V100 GPU.
Related papers
- MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling [64.09238330331195]
We propose a novel Multi-Modal Auto-Regressive (MMAR) probabilistic modeling framework.
Unlike discretization line of method, MMAR takes in continuous-valued image tokens to avoid information loss.
We show that MMAR demonstrates much more superior performance than other joint multi-modal models.
arXiv Detail & Related papers (2024-10-14T17:57:18Z) - Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis [62.06970466554273]
We present Meissonic, which non-autoregressive masked image modeling (MIM) text-to-image elevates to a level comparable with state-of-the-art diffusion models like SDXL.
We leverage high-quality training data, integrate micro-conditions informed by human preference scores, and employ feature compression layers to further enhance image fidelity and resolution.
Our model not only matches but often exceeds the performance of existing models like SDXL in generating high-quality, high-resolution images.
arXiv Detail & Related papers (2024-10-10T17:59:17Z) - Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding [60.188309982690335]
We propose a training-free probabilistic parallel decoding algorithm, Speculative Jacobi Decoding (SJD), to accelerate auto-regressive text-to-image generation.
By introducing a probabilistic convergence criterion, our SJD accelerates the inference of auto-regressive text-to-image generation while maintaining the randomness in sampling-based token decoding.
arXiv Detail & Related papers (2024-10-02T16:05:27Z) - Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation [52.509092010267665]
We introduce LlamaGen, a new family of image generation models that apply original next-token prediction'' paradigm of large language models to visual generation domain.
It is an affirmative answer to whether vanilla autoregressive models, e.g., Llama, without inductive biases on visual signals can achieve state-of-the-art image generation performance if scaling properly.
arXiv Detail & Related papers (2024-06-10T17:59:52Z) - RL for Consistency Models: Faster Reward Guided Text-to-Image Generation [15.238373471473645]
We propose a framework for fine-tuning consistency models viaReinforcement Learning (RL)
Our framework, called Reinforcement Learning for Consistency Model (RLCM), frames the iterative inference process of a consistency model as an RL procedure.
Comparing to RL finetuned diffusion models, RLCM trains significantly faster, improves the quality of the generation measured under the reward objectives, and speeds up the inference procedure by generating high quality images with as few as two inference steps.
arXiv Detail & Related papers (2024-03-25T15:40:22Z) - Consistency Models [89.68380014789861]
We propose a new family of models that generate high quality samples by directly mapping noise to data.
They support fast one-step generation by design, while still allowing multistep sampling to trade compute for sample quality.
They also support zero-shot data editing, such as image inpainting, colorization, and super-resolution, without requiring explicit training.
arXiv Detail & Related papers (2023-03-02T18:30:16Z) - Non-Autoregressive Image Captioning with Counterfactuals-Critical
Multi-Agent Learning [46.060954649681385]
We propose a Non-Autoregressive Image Captioning model with a novel training paradigm: Counterfactuals-critical Multi-Agent Learning (CMAL)
Our NAIC model achieves a performance comparable to state-of-the-art autoregressive models, while brings 13.9x decoding speedup.
arXiv Detail & Related papers (2020-05-10T15:09:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.