DreamSync: Aligning Text-to-Image Generation with Image Understanding
Feedback
- URL: http://arxiv.org/abs/2311.17946v1
- Date: Wed, 29 Nov 2023 03:42:16 GMT
- Title: DreamSync: Aligning Text-to-Image Generation with Image Understanding
Feedback
- Authors: Jiao Sun, Deqing Fu, Yushi Hu, Su Wang, Royi Rassin, Da-Cheng Juan,
Dana Alon, Charles Herrmann, Sjoerd van Steenkiste, Ranjay Krishna, Cyrus
Rashtchian
- Abstract summary: Text-to-Image models (T2I) still struggle to produce images that are both aesthetically pleasing and faithful to the user's input text.
We introduce DreamSync, a model-agnostic training algorithm by design that improves T2I models to be faithful to the text input.
Despite its simplicity, DreamSync improves both the semantic alignment and aesthetic appeal of two diffusion-based T2I models.
- Score: 38.81701138951801
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite their wide-spread success, Text-to-Image models (T2I) still struggle
to produce images that are both aesthetically pleasing and faithful to the
user's input text. We introduce DreamSync, a model-agnostic training algorithm
by design that improves T2I models to be faithful to the text input. DreamSync
builds off a recent insight from TIFA's evaluation framework -- that large
vision-language models (VLMs) can effectively identify the fine-grained
discrepancies between generated images and the text inputs. DreamSync uses this
insight to train T2I models without any labeled data; it improves T2I models
using its own generations. First, it prompts the model to generate several
candidate images for a given input text. Then, it uses two VLMs to select the
best generation: a Visual Question Answering model that measures the alignment
of generated images to the text, and another that measures the generation's
aesthetic quality. After selection, we use LoRA to iteratively finetune the T2I
model to guide its generation towards the selected best generations. DreamSync
does not need any additional human annotation. model architecture changes, or
reinforcement learning. Despite its simplicity, DreamSync improves both the
semantic alignment and aesthetic appeal of two diffusion-based T2I models,
evidenced by multiple benchmarks (+1.7% on TIFA, +2.9% on DSG1K, +3.4% on VILA
aesthetic) and human evaluation.
Related papers
- Kandinsky 3: Text-to-Image Synthesis for Multifunctional Generative Framework [3.7953598825170753]
Kandinsky 3 is a novel T2I model based on latent diffusion, achieving a high level of quality and photorealism.
We extend the base T2I model for various applications and create a multifunctional generation system.
Human evaluations show that Kandinsky 3 demonstrates one of the highest quality scores among open source generation systems.
arXiv Detail & Related papers (2024-10-28T14:22:08Z) - Still-Moving: Customized Video Generation without Customized Video Data [81.09302547183155]
We introduce Still-Moving, a novel framework for customizing a text-to-video (T2V) model.
The framework applies to the prominent T2V design where the video model is built over a text-to-image (T2I) model.
We train lightweight $textitSpatial Adapters$ that adjust the features produced by the injected T2I layers.
arXiv Detail & Related papers (2024-07-11T17:06:53Z) - VersaT2I: Improving Text-to-Image Models with Versatile Reward [32.30564849001593]
VersaT2I is a versatile training framework that can boost the performance of any text-to-image (T2I) model.
We decompose the quality of the image into several aspects such as aesthetics, text-image alignment, geometry, low-level quality, etc.
arXiv Detail & Related papers (2024-03-27T12:08:41Z) - SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with
Auto-Generated Data [73.23388142296535]
SELMA improves the faithfulness of T2I models by fine-tuning models on automatically generated, multi-skill image-text datasets.
We show that SELMA significantly improves the semantic alignment and text faithfulness of state-of-the-art T2I diffusion models on multiple benchmarks.
We also show that fine-tuning with image-text pairs auto-collected via SELMA shows comparable performance to fine-tuning with ground truth data.
arXiv Detail & Related papers (2024-03-11T17:35:33Z) - Direct Consistency Optimization for Compositional Text-to-Image
Personalization [73.94505688626651]
Text-to-image (T2I) diffusion models, when fine-tuned on a few personal images, are able to generate visuals with a high degree of consistency.
We propose to fine-tune the T2I model by maximizing consistency to reference images, while penalizing the deviation from the pretrained model.
arXiv Detail & Related papers (2024-02-19T09:52:41Z) - DreamDistribution: Prompt Distribution Learning for Text-to-Image
Diffusion Models [53.17454737232668]
We introduce a solution that allows a pretrained T2I diffusion model to learn a set of soft prompts.
These prompts offer text-guided editing capabilities and additional flexibility in controlling variation and mixing between multiple distributions.
We also show the adaptability of the learned prompt distribution to other tasks, such as text-to-3D.
arXiv Detail & Related papers (2023-12-21T12:11:00Z) - GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation [143.81719619351335]
Text-to-image (T2I) models based on diffusion processes have achieved remarkable success in controllable image generation using user-provided captions.
The tight coupling between the current text encoder and image decoder in T2I models makes it challenging to replace or upgrade.
We propose GlueGen, which applies a newly proposed GlueNet model to align features from single-modal or multi-modal encoders with the latent space of an existing T2I model.
arXiv Detail & Related papers (2023-03-17T15:37:07Z) - Tune-A-Video: One-Shot Tuning of Image Diffusion Models for
Text-to-Video Generation [31.882356164068753]
To reproduce the success of text-to-image (T2I) generation, recent works in text-to-video (T2V) generation employ massive dataset for dataset for T2V generation.
We propose Tune-A-Video is capable of producing temporally-coherent videos over various applications.
arXiv Detail & Related papers (2022-12-22T09:43:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.