If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based
Text-to-Image Generation by Selection
- URL: http://arxiv.org/abs/2305.13308v1
- Date: Mon, 22 May 2023 17:59:41 GMT
- Title: If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based
Text-to-Image Generation by Selection
- Authors: Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, Zeynep Akata
- Abstract summary: diffusion-based text-to-image (T2I) models can lack faithfulness to the text prompt.
We show that large T2I diffusion models are more faithful than usually assumed, and can generate images faithful to even complex prompts.
We introduce a pipeline that generates candidate images for a text prompt and picks the best one according to an automatic scoring system.
- Score: 53.320946030761796
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite their impressive capabilities, diffusion-based text-to-image (T2I)
models can lack faithfulness to the text prompt, where generated images may not
contain all the mentioned objects, attributes or relations. To alleviate these
issues, recent works proposed post-hoc methods to improve model faithfulness
without costly retraining, by modifying how the model utilizes the input
prompt. In this work, we take a step back and show that large T2I diffusion
models are more faithful than usually assumed, and can generate images faithful
to even complex prompts without the need to manipulate the generative process.
Based on that, we show how faithfulness can be simply treated as a candidate
selection problem instead, and introduce a straightforward pipeline that
generates candidate images for a text prompt and picks the best one according
to an automatic scoring system that can leverage already existing T2I
evaluation metrics. Quantitative comparisons alongside user studies on diverse
benchmarks show consistently improved faithfulness over post-hoc enhancement
methods, with comparable or lower computational cost. Code is available at
\url{https://github.com/ExplainableML/ImageSelect}.
Related papers
- Regeneration Based Training-free Attribution of Fake Images Generated by
Text-to-Image Generative Models [39.33821502730661]
We present a training-free method to attribute fake images generated by text-to-image models to their source models.
By calculating and ranking the similarity of the test image and the candidate images, we can determine the source of the image.
arXiv Detail & Related papers (2024-03-03T11:55:49Z) - Direct Consistency Optimization for Compositional Text-to-Image
Personalization [73.94505688626651]
Text-to-image (T2I) diffusion models, when fine-tuned on a few personal images, are able to generate visuals with a high degree of consistency.
We propose to fine-tune the T2I model by maximizing consistency to reference images, while penalizing the deviation from the pretrained model.
arXiv Detail & Related papers (2024-02-19T09:52:41Z) - Learning from Mistakes: Iterative Prompt Relabeling for Text-to-Image Diffusion Model Training [33.51524424536508]
Iterative Prompt Relabeling (IPR) is a novel algorithm that aligns images to text through iterative image sampling and prompt relabeling.
We conduct thorough experiments on SDv2 and SDXL, testing their capability to follow instructions on spatial relations.
arXiv Detail & Related papers (2023-12-23T11:10:43Z) - Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion
Models [94.25020178662392]
Text-to-image (T2I) research has grown explosively in the past year.
One pain point persists: the text prompt engineering, and searching high-quality text prompts for customized results is more art than science.
In this paper, we take "Text" out of a pre-trained T2I diffusion model, to reduce the burdensome prompt engineering efforts for users.
arXiv Detail & Related papers (2023-05-25T16:30:07Z) - LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image
Diffusion Models with Large Language Models [62.75006608940132]
This work proposes to enhance prompt understanding capabilities in text-to-image diffusion models.
Our method leverages a pretrained large language model for grounded generation in a novel two-stage process.
Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images.
arXiv Detail & Related papers (2023-05-23T03:59:06Z) - Discriminative Class Tokens for Text-to-Image Diffusion Models [107.98436819341592]
We propose a non-invasive fine-tuning technique that capitalizes on the expressive potential of free-form text.
Our method is fast compared to prior fine-tuning methods and does not require a collection of in-class images.
We evaluate our method extensively, showing that the generated images are: (i) more accurate and of higher quality than standard diffusion models, (ii) can be used to augment training data in a low-resource setting, and (iii) reveal information about the data used to train the guiding classifier.
arXiv Detail & Related papers (2023-03-30T05:25:20Z) - Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences.
In this work, we repurpose such models to generate a descriptive text given an image at inference time.
The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.