Emu: Enhancing Image Generation Models Using Photogenic Needles in a
Haystack
- URL: http://arxiv.org/abs/2309.15807v1
- Date: Wed, 27 Sep 2023 17:30:19 GMT
- Title: Emu: Enhancing Image Generation Models Using Photogenic Needles in a
Haystack
- Authors: Xiaoliang Dai, Ji Hou, Chih-Yao Ma, Sam Tsai, Jialiang Wang, Rui Wang,
Peizhao Zhang, Simon Vandenhende, Xiaofang Wang, Abhimanyu Dubey, Matthew Yu,
Abhishek Kadian, Filip Radenovic, Dhruv Mahajan, Kunpeng Li, Yue Zhao, Vladan
Petrovic, Mitesh Kumar Singh, Simran Motwani, Yi Wen, Yiwen Song, Roshan
Sumbaly, Vignesh Ramanathan, Zijian He, Peter Vajda, Devi Parikh
- Abstract summary: Training text-to-image models with web scale image-text pairs enables the generation of a wide range of visual concepts from text.
These pre-trained models often face challenges when it comes to generating highly aesthetic images.
We propose quality-tuning to guide a pre-trained model to exclusively generate highly visually appealing images.
- Score: 75.00066365801993
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training text-to-image models with web scale image-text pairs enables the
generation of a wide range of visual concepts from text. However, these
pre-trained models often face challenges when it comes to generating highly
aesthetic images. This creates the need for aesthetic alignment post
pre-training. In this paper, we propose quality-tuning to effectively guide a
pre-trained model to exclusively generate highly visually appealing images,
while maintaining generality across visual concepts. Our key insight is that
supervised fine-tuning with a set of surprisingly small but extremely visually
appealing images can significantly improve the generation quality. We pre-train
a latent diffusion model on $1.1$ billion image-text pairs and fine-tune it
with only a few thousand carefully selected high-quality images. The resulting
model, Emu, achieves a win rate of $82.9\%$ compared with its pre-trained only
counterpart. Compared to the state-of-the-art SDXLv1.0, Emu is preferred
$68.4\%$ and $71.3\%$ of the time on visual appeal on the standard PartiPrompts
and our Open User Input benchmark based on the real-world usage of
text-to-image models. In addition, we show that quality-tuning is a generic
approach that is also effective for other architectures, including pixel
diffusion and masked generative transformer models.
Related papers
- PrefPaint: Aligning Image Inpainting Diffusion Model with Human Preference [62.72779589895124]
We make the first attempt to align diffusion models for image inpainting with human aesthetic standards via a reinforcement learning framework.
We train a reward model with a dataset we construct, consisting of nearly 51,000 images annotated with human preferences.
Experiments on inpainting comparison and downstream tasks, such as image extension and 3D reconstruction, demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2024-10-29T11:49:39Z) - Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation [52.509092010267665]
We introduce LlamaGen, a new family of image generation models that apply original next-token prediction'' paradigm of large language models to visual generation domain.
It is an affirmative answer to whether vanilla autoregressive models, e.g., Llama, without inductive biases on visual signals can achieve state-of-the-art image generation performance if scaling properly.
arXiv Detail & Related papers (2024-06-10T17:59:52Z) - Improving face generation quality and prompt following with synthetic captions [57.47448046728439]
We introduce a training-free pipeline designed to generate accurate appearance descriptions from images of people.
We then use these synthetic captions to fine-tune a text-to-image diffusion model.
Our results demonstrate that this approach significantly improves the model's ability to generate high-quality, realistic human faces.
arXiv Detail & Related papers (2024-05-17T15:50:53Z) - Direct Consistency Optimization for Compositional Text-to-Image
Personalization [73.94505688626651]
Text-to-image (T2I) diffusion models, when fine-tuned on a few personal images, are able to generate visuals with a high degree of consistency.
We propose to fine-tune the T2I model by maximizing consistency to reference images, while penalizing the deviation from the pretrained model.
arXiv Detail & Related papers (2024-02-19T09:52:41Z) - Implementing and Experimenting with Diffusion Models for Text-to-Image
Generation [0.0]
Two models, DALL-E 2 and Imagen, have demonstrated that highly photorealistic images could be generated from a simple textual description of an image.
Text-to-image models require exceptionally large amounts of computational resources to train, as well as handling huge datasets collected from the internet.
This thesis contributes by reviewing the different approaches and techniques used by these models, and then by proposing our own implementation of a text-to-image model.
arXiv Detail & Related papers (2022-09-22T12:03:33Z) - LAFITE: Towards Language-Free Training for Text-to-Image Generation [83.2935513540494]
We propose the first work to train text-to-image generation models without any text data.
Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model.
We obtain state-of-the-art results in the standard text-to-image generation tasks.
arXiv Detail & Related papers (2021-11-27T01:54:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.