SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with
Auto-Generated Data
- URL: http://arxiv.org/abs/2403.06952v1
- Date: Mon, 11 Mar 2024 17:35:33 GMT
- Title: SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with
Auto-Generated Data
- Authors: Jialu Li, Jaemin Cho, Yi-Lin Sung, Jaehong Yoon, Mohit Bansal
- Abstract summary: SELMA improves the faithfulness of T2I models by fine-tuning models on automatically generated, multi-skill image-text datasets.
We show that SELMA significantly improves the semantic alignment and text faithfulness of state-of-the-art T2I diffusion models on multiple benchmarks.
We also show that fine-tuning with image-text pairs auto-collected via SELMA shows comparable performance to fine-tuning with ground truth data.
- Score: 73.23388142296535
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent text-to-image (T2I) generation models have demonstrated impressive
capabilities in creating images from text descriptions. However, these T2I
generation models often fall short of generating images that precisely match
the details of the text inputs, such as incorrect spatial relationship or
missing objects. In this paper, we introduce SELMA: Skill-Specific Expert
Learning and Merging with Auto-Generated Data, a novel paradigm to improve the
faithfulness of T2I models by fine-tuning models on automatically generated,
multi-skill image-text datasets, with skill-specific expert learning and
merging. First, SELMA leverages an LLM's in-context learning capability to
generate multiple datasets of text prompts that can teach different skills, and
then generates the images with a T2I model based on the prompts. Next, SELMA
adapts the T2I model to the new skills by learning multiple single-skill LoRA
(low-rank adaptation) experts followed by expert merging. Our independent
expert fine-tuning specializes multiple models for different skills, and expert
merging helps build a joint multi-skill T2I model that can generate faithful
images given diverse text prompts, while mitigating the knowledge conflict from
different datasets. We empirically demonstrate that SELMA significantly
improves the semantic alignment and text faithfulness of state-of-the-art T2I
diffusion models on multiple benchmarks (+2.1% on TIFA and +6.9% on DSG), human
preference metrics (PickScore, ImageReward, and HPS), as well as human
evaluation. Moreover, fine-tuning with image-text pairs auto-collected via
SELMA shows comparable performance to fine-tuning with ground truth data.
Lastly, we show that fine-tuning with images from a weaker T2I model can help
improve the generation quality of a stronger T2I model, suggesting promising
weak-to-strong generalization in T2I models.
Related papers
- Text-to-Image Synthesis: A Decade Survey [7.250878248686215]
Text-to-image synthesis (T2I) focuses on generating high-quality images from textual descriptions.
In this survey, we review over 440 recent works on T2I.
arXiv Detail & Related papers (2024-11-25T07:40:32Z) - Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models [54.052963634384945]
We introduce the Image Regeneration task to assess text-to-image models.
We use GPT4V to bridge the gap between the reference image and the text input for the T2I model.
We also present ImageRepainter framework to enhance the quality of generated images.
arXiv Detail & Related papers (2024-11-14T13:52:43Z) - Evaluating the Generation of Spatial Relations in Text and Image Generative Models [4.281091463408283]
spatial relations are naturally understood in a visuo-spatial manner.
We develop an approach to convert LLM outputs into an image, thereby allowing us to evaluate both T2I models and LLMs.
Surprisingly, we found that T2I models only achieve subpar performance despite their impressive general image-generation abilities.
arXiv Detail & Related papers (2024-11-12T09:30:02Z) - VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC)
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions.
In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z) - Enhance Image Classification via Inter-Class Image Mixup with Diffusion Model [80.61157097223058]
A prevalent strategy to bolster image classification performance is through augmenting the training set with synthetic images generated by T2I models.
In this study, we scrutinize the shortcomings of both current generative and conventional data augmentation techniques.
We introduce an innovative inter-class data augmentation method known as Diff-Mix, which enriches the dataset by performing image translations between classes.
arXiv Detail & Related papers (2024-03-28T17:23:45Z) - VersaT2I: Improving Text-to-Image Models with Versatile Reward [32.30564849001593]
VersaT2I is a versatile training framework that can boost the performance of any text-to-image (T2I) model.
We decompose the quality of the image into several aspects such as aesthetics, text-image alignment, geometry, low-level quality, etc.
arXiv Detail & Related papers (2024-03-27T12:08:41Z) - Paragraph-to-Image Generation with Information-Enriched Diffusion Model [67.9265336953134]
ParaDiffusion is an information-enriched diffusion model for paragraph-to-image generation task.
It delves into the transference of the extensive semantic comprehension capabilities of large language models to the task of image generation.
The code and dataset will be released to foster community research on long-text alignment.
arXiv Detail & Related papers (2023-11-24T05:17:01Z) - Mini-DALLE3: Interactive Text to Image by Prompting Large Language
Models [71.49054220807983]
A prevalent limitation persists in the effective communication with T2I models, such as Stable Diffusion, using natural language descriptions.
Inspired by the recently released DALLE3, we revisit the existing T2I systems endeavoring to align human intent and introduce a new task - interactive text to image (iT2I)
We present a simple approach that augments LLMs for iT2I with prompting techniques and off-the-shelf T2I models.
arXiv Detail & Related papers (2023-10-11T16:53:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.