SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with
Auto-Generated Data
- URL: http://arxiv.org/abs/2403.06952v1
- Date: Mon, 11 Mar 2024 17:35:33 GMT
- Title: SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with
Auto-Generated Data
- Authors: Jialu Li, Jaemin Cho, Yi-Lin Sung, Jaehong Yoon, Mohit Bansal
- Abstract summary: SELMA improves the faithfulness of T2I models by fine-tuning models on automatically generated, multi-skill image-text datasets.
We show that SELMA significantly improves the semantic alignment and text faithfulness of state-of-the-art T2I diffusion models on multiple benchmarks.
We also show that fine-tuning with image-text pairs auto-collected via SELMA shows comparable performance to fine-tuning with ground truth data.
- Score: 73.23388142296535
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent text-to-image (T2I) generation models have demonstrated impressive
capabilities in creating images from text descriptions. However, these T2I
generation models often fall short of generating images that precisely match
the details of the text inputs, such as incorrect spatial relationship or
missing objects. In this paper, we introduce SELMA: Skill-Specific Expert
Learning and Merging with Auto-Generated Data, a novel paradigm to improve the
faithfulness of T2I models by fine-tuning models on automatically generated,
multi-skill image-text datasets, with skill-specific expert learning and
merging. First, SELMA leverages an LLM's in-context learning capability to
generate multiple datasets of text prompts that can teach different skills, and
then generates the images with a T2I model based on the prompts. Next, SELMA
adapts the T2I model to the new skills by learning multiple single-skill LoRA
(low-rank adaptation) experts followed by expert merging. Our independent
expert fine-tuning specializes multiple models for different skills, and expert
merging helps build a joint multi-skill T2I model that can generate faithful
images given diverse text prompts, while mitigating the knowledge conflict from
different datasets. We empirically demonstrate that SELMA significantly
improves the semantic alignment and text faithfulness of state-of-the-art T2I
diffusion models on multiple benchmarks (+2.1% on TIFA and +6.9% on DSG), human
preference metrics (PickScore, ImageReward, and HPS), as well as human
evaluation. Moreover, fine-tuning with image-text pairs auto-collected via
SELMA shows comparable performance to fine-tuning with ground truth data.
Lastly, we show that fine-tuning with images from a weaker T2I model can help
improve the generation quality of a stronger T2I model, suggesting promising
weak-to-strong generalization in T2I models.
Related papers
- TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation [22.782099757385804]
TIP-I2V is the first large-scale dataset of user-provided text and image prompts for image-to-video generation.
We provide the corresponding generated videos from five state-of-the-art image-to-video models.
arXiv Detail & Related papers (2024-11-05T18:52:43Z) - VLEU: a Method for Automatic Evaluation for Generalizability of Text-to-Image Models [18.259733507395634]
We introduce a new metric called Visual Language Evaluation Understudy (VLEU)
VLEU quantifies a model's generalizability by computing the Kullback-Leibler divergence between the marginal distribution of the visual text and the conditional distribution of the images generated by the model.
Our experiments demonstrate the effectiveness of VLEU in evaluating the generalization capability of various T2I models.
arXiv Detail & Related papers (2024-09-23T04:50:36Z) - VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC)
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions.
In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z) - ANCHOR: LLM-driven News Subject Conditioning for Text-to-Image Synthesis [6.066100464517522]
We introduce the Abstractive News Captions with High-level cOntext Representation dataset, containing 70K+ samples sourced from 5 different news media organizations.
Our proposed method Subject-Aware Finetuning (SAFE), selects and enhances the representation of key subjects in synthesized images by leveraging LLM-generated subject weights.
It also adapts to the domain distribution of news images and captions through custom Domain Fine-tuning, outperforming current T2I baselines on ANCHOR.
arXiv Detail & Related papers (2024-04-15T21:19:10Z) - Enhance Image Classification via Inter-Class Image Mixup with Diffusion Model [80.61157097223058]
A prevalent strategy to bolster image classification performance is through augmenting the training set with synthetic images generated by T2I models.
In this study, we scrutinize the shortcomings of both current generative and conventional data augmentation techniques.
We introduce an innovative inter-class data augmentation method known as Diff-Mix, which enriches the dataset by performing image translations between classes.
arXiv Detail & Related papers (2024-03-28T17:23:45Z) - VersaT2I: Improving Text-to-Image Models with Versatile Reward [32.30564849001593]
VersaT2I is a versatile training framework that can boost the performance of any text-to-image (T2I) model.
We decompose the quality of the image into several aspects such as aesthetics, text-image alignment, geometry, low-level quality, etc.
arXiv Detail & Related papers (2024-03-27T12:08:41Z) - DreamDistribution: Prompt Distribution Learning for Text-to-Image
Diffusion Models [53.17454737232668]
We introduce a solution that allows a pretrained T2I diffusion model to learn a set of soft prompts.
These prompts offer text-guided editing capabilities and additional flexibility in controlling variation and mixing between multiple distributions.
We also show the adaptability of the learned prompt distribution to other tasks, such as text-to-3D.
arXiv Detail & Related papers (2023-12-21T12:11:00Z) - Paragraph-to-Image Generation with Information-Enriched Diffusion Model [67.9265336953134]
ParaDiffusion is an information-enriched diffusion model for paragraph-to-image generation task.
It delves into the transference of the extensive semantic comprehension capabilities of large language models to the task of image generation.
The code and dataset will be released to foster community research on long-text alignment.
arXiv Detail & Related papers (2023-11-24T05:17:01Z) - Mini-DALLE3: Interactive Text to Image by Prompting Large Language
Models [71.49054220807983]
A prevalent limitation persists in the effective communication with T2I models, such as Stable Diffusion, using natural language descriptions.
Inspired by the recently released DALLE3, we revisit the existing T2I systems endeavoring to align human intent and introduce a new task - interactive text to image (iT2I)
We present a simple approach that augments LLMs for iT2I with prompting techniques and off-the-shelf T2I models.
arXiv Detail & Related papers (2023-10-11T16:53:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.