Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
- URL: http://arxiv.org/abs/2406.06525v1
- Date: Mon, 10 Jun 2024 17:59:52 GMT
- Title: Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
- Authors: Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, Zehuan Yuan,
- Abstract summary: We introduce LlamaGen, a new family of image generation models that apply original next-token prediction'' paradigm of large language models to visual generation domain.
It is an affirmative answer to whether vanilla autoregressive models, e.g., Llama, without inductive biases on visual signals can achieve state-of-the-art image generation performance if scaling properly.
- Score: 52.509092010267665
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce LlamaGen, a new family of image generation models that apply original ``next-token prediction'' paradigm of large language models to visual generation domain. It is an affirmative answer to whether vanilla autoregressive models, e.g., Llama, without inductive biases on visual signals can achieve state-of-the-art image generation performance if scaling properly. We reexamine design spaces of image tokenizers, scalability properties of image generation models, and their training data quality. The outcome of this exploration consists of: (1) An image tokenizer with downsample ratio of 16, reconstruction quality of 0.94 rFID and codebook usage of 97% on ImageNet benchmark. (2) A series of class-conditional image generation models ranging from 111M to 3.1B parameters, achieving 2.18 FID on ImageNet 256x256 benchmarks, outperforming the popular diffusion models such as LDM, DiT. (3) A text-conditional image generation model with 775M parameters, from two-stage training on LAION-COCO and high aesthetics quality images, demonstrating competitive performance of visual quality and text alignment. (4) We verify the effectiveness of LLM serving frameworks in optimizing the inference speed of image generation models and achieve 326% - 414% speedup. We release all models and codes to facilitate open-source community of visual generation and multimodal foundation models.
Related papers
- Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis [62.06970466554273]
We present Meissonic, which non-autoregressive masked image modeling (MIM) text-to-image elevates to a level comparable with state-of-the-art diffusion models like SDXL.
We leverage high-quality training data, integrate micro-conditions informed by human preference scores, and employ feature compression layers to further enhance image fidelity and resolution.
Our model not only matches but often exceeds the performance of existing models like SDXL in generating high-quality, high-resolution images.
arXiv Detail & Related papers (2024-10-10T17:59:17Z) - RL for Consistency Models: Faster Reward Guided Text-to-Image Generation [15.238373471473645]
We propose a framework for fine-tuning consistency models viaReinforcement Learning (RL)
Our framework, called Reinforcement Learning for Consistency Model (RLCM), frames the iterative inference process of a consistency model as an RL procedure.
Comparing to RL finetuned diffusion models, RLCM trains significantly faster, improves the quality of the generation measured under the reward objectives, and speeds up the inference procedure by generating high quality images with as few as two inference steps.
arXiv Detail & Related papers (2024-03-25T15:40:22Z) - Active Generation for Image Classification [45.93535669217115]
We propose to address the efficiency of image generation by focusing on the specific needs and characteristics of the model.
With a central tenet of active learning, our method, named ActGen, takes a training-aware approach to image generation.
arXiv Detail & Related papers (2024-03-11T08:45:31Z) - Emage: Non-Autoregressive Text-to-Image Generation [63.347052548210236]
Non-autoregressive text-to-image models efficiently generate hundreds of image tokens in parallel.
Our model with 346M parameters generates an image of 256$times$256 with about one second on one V100 GPU.
arXiv Detail & Related papers (2023-12-22T10:01:54Z) - Emu: Enhancing Image Generation Models Using Photogenic Needles in a
Haystack [75.00066365801993]
Training text-to-image models with web scale image-text pairs enables the generation of a wide range of visual concepts from text.
These pre-trained models often face challenges when it comes to generating highly aesthetic images.
We propose quality-tuning to guide a pre-trained model to exclusively generate highly visually appealing images.
arXiv Detail & Related papers (2023-09-27T17:30:19Z) - Synthetic Data from Diffusion Models Improves ImageNet Classification [47.999055841125156]
Large-scale text-to image diffusion models can be fine-tuned to produce class conditional models.
Augmenting the ImageNet training set with samples from the resulting models yields significant improvements in ImageNet classification accuracy.
arXiv Detail & Related papers (2023-04-17T17:42:29Z) - Consistency Models [89.68380014789861]
We propose a new family of models that generate high quality samples by directly mapping noise to data.
They support fast one-step generation by design, while still allowing multistep sampling to trade compute for sample quality.
They also support zero-shot data editing, such as image inpainting, colorization, and super-resolution, without requiring explicit training.
arXiv Detail & Related papers (2023-03-02T18:30:16Z) - Meta Internal Learning [88.68276505511922]
Internal learning for single-image generation is a framework, where a generator is trained to produce novel images based on a single image.
We propose a meta-learning approach that enables training over a collection of images, in order to model the internal statistics of the sample image more effectively.
Our results show that the models obtained are as suitable as single-image GANs for many common image applications.
arXiv Detail & Related papers (2021-10-06T16:27:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.