VersaT2I: Improving Text-to-Image Models with Versatile Reward
- URL: http://arxiv.org/abs/2403.18493v1
- Date: Wed, 27 Mar 2024 12:08:41 GMT
- Title: VersaT2I: Improving Text-to-Image Models with Versatile Reward
- Authors: Jianshu Guo, Wenhao Chai, Jie Deng, Hsiang-Wei Huang, Tian Ye, Yichen Xu, Jiawei Zhang, Jenq-Neng Hwang, Gaoang Wang,
- Abstract summary: VersaT2I is a versatile training framework that can boost the performance of any text-to-image (T2I) model.
We decompose the quality of the image into several aspects such as aesthetics, text-image alignment, geometry, low-level quality, etc.
- Score: 32.30564849001593
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent text-to-image (T2I) models have benefited from large-scale and high-quality data, demonstrating impressive performance. However, these T2I models still struggle to produce images that are aesthetically pleasing, geometrically accurate, faithful to text, and of good low-level quality. We present VersaT2I, a versatile training framework that can boost the performance with multiple rewards of any T2I model. We decompose the quality of the image into several aspects such as aesthetics, text-image alignment, geometry, low-level quality, etc. Then, for every quality aspect, we select high-quality images in this aspect generated by the model as the training set to finetune the T2I model using the Low-Rank Adaptation (LoRA). Furthermore, we introduce a gating function to combine multiple quality aspects, which can avoid conflicts between different quality aspects. Our method is easy to extend and does not require any manual annotation, reinforcement learning, or model architecture changes. Extensive experiments demonstrate that VersaT2I outperforms the baseline methods across various quality criteria.
Related papers
- Kandinsky 3: Text-to-Image Synthesis for Multifunctional Generative Framework [3.7953598825170753]
Kandinsky 3 is a novel T2I model based on latent diffusion, achieving a high level of quality and photorealism.
We extend the base T2I model for various applications and create a multifunctional generation system.
Human evaluations show that Kandinsky 3 demonstrates one of the highest quality scores among open source generation systems.
arXiv Detail & Related papers (2024-10-28T14:22:08Z) - Translatotron-V(ison): An End-to-End Model for In-Image Machine Translation [81.45400849638347]
In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language.
In this paper, we propose an end-to-end IIMT model consisting of four modules.
Our model achieves competitive performance compared to cascaded models with only 70.9% of parameters, and significantly outperforms the pixel-level end-to-end IIMT model.
arXiv Detail & Related papers (2024-07-03T08:15:39Z) - Enhance Image Classification via Inter-Class Image Mixup with Diffusion Model [80.61157097223058]
A prevalent strategy to bolster image classification performance is through augmenting the training set with synthetic images generated by T2I models.
In this study, we scrutinize the shortcomings of both current generative and conventional data augmentation techniques.
We introduce an innovative inter-class data augmentation method known as Diff-Mix, which enriches the dataset by performing image translations between classes.
arXiv Detail & Related papers (2024-03-28T17:23:45Z) - SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with
Auto-Generated Data [73.23388142296535]
SELMA improves the faithfulness of T2I models by fine-tuning models on automatically generated, multi-skill image-text datasets.
We show that SELMA significantly improves the semantic alignment and text faithfulness of state-of-the-art T2I diffusion models on multiple benchmarks.
We also show that fine-tuning with image-text pairs auto-collected via SELMA shows comparable performance to fine-tuning with ground truth data.
arXiv Detail & Related papers (2024-03-11T17:35:33Z) - Direct Consistency Optimization for Compositional Text-to-Image
Personalization [73.94505688626651]
Text-to-image (T2I) diffusion models, when fine-tuned on a few personal images, are able to generate visuals with a high degree of consistency.
We propose to fine-tune the T2I model by maximizing consistency to reference images, while penalizing the deviation from the pretrained model.
arXiv Detail & Related papers (2024-02-19T09:52:41Z) - DreamSync: Aligning Text-to-Image Generation with Image Understanding
Feedback [38.81701138951801]
Text-to-Image models (T2I) still struggle to produce images that are both aesthetically pleasing and faithful to the user's input text.
We introduce DreamSync, a model-agnostic training algorithm by design that improves T2I models to be faithful to the text input.
Despite its simplicity, DreamSync improves both the semantic alignment and aesthetic appeal of two diffusion-based T2I models.
arXiv Detail & Related papers (2023-11-29T03:42:16Z) - Emu: Enhancing Image Generation Models Using Photogenic Needles in a
Haystack [75.00066365801993]
Training text-to-image models with web scale image-text pairs enables the generation of a wide range of visual concepts from text.
These pre-trained models often face challenges when it comes to generating highly aesthetic images.
We propose quality-tuning to guide a pre-trained model to exclusively generate highly visually appealing images.
arXiv Detail & Related papers (2023-09-27T17:30:19Z) - Towards Unsupervised Deep Image Enhancement with Generative Adversarial
Network [92.01145655155374]
We present an unsupervised image enhancement generative network (UEGAN)
It learns the corresponding image-to-image mapping from a set of images with desired characteristics in an unsupervised manner.
Results show that the proposed model effectively improves the aesthetic quality of images.
arXiv Detail & Related papers (2020-12-30T03:22:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.