Related papers: If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based Text-to-Image Generation by Selection

If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based Text-to-Image Generation by Selection

URL: http://arxiv.org/abs/2305.13308v1
Date: Mon, 22 May 2023 17:59:41 GMT
Title: If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based Text-to-Image Generation by Selection
Authors: Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, Zeynep Akata
Abstract summary: diffusion-based text-to-image (T2I) models can lack faithfulness to the text prompt. We show that large T2I diffusion models are more faithful than usually assumed, and can generate images faithful to even complex prompts. We introduce a pipeline that generates candidate images for a text prompt and picks the best one according to an automatic scoring system.
Score: 53.320946030761796
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite their impressive capabilities, diffusion-based text-to-image (T2I) models can lack faithfulness to the text prompt, where generated images may not contain all the mentioned objects, attributes or relations. To alleviate these issues, recent works proposed post-hoc methods to improve model faithfulness without costly retraining, by modifying how the model utilizes the input prompt. In this work, we take a step back and show that large T2I diffusion models are more faithful than usually assumed, and can generate images faithful to even complex prompts without the need to manipulate the generative process. Based on that, we show how faithfulness can be simply treated as a candidate selection problem instead, and introduce a straightforward pipeline that generates candidate images for a text prompt and picks the best one according to an automatic scoring system that can leverage already existing T2I evaluation metrics. Quantitative comparisons alongside user studies on diverse benchmarks show consistently improved faithfulness over post-hoc enhancement methods, with comparable or lower computational cost. Code is available at \url{https://github.com/ExplainableML/ImageSelect}.

Related papers

Aligning Text to Image in Diffusion Models is Easier Than You Think [47.623236425067326]
We introduce a lightweight contrastive fine tuning strategy called SoftREPA that uses soft text tokens. Our method explicitly increases the mutual information between text and image representations, leading to enhanced semantic consistency.
arXiv Detail & Related papers (2025-03-11T10:14:22Z)
One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation Using a Single Prompt [101.17660804110409]
Text-to-image generation models can create high-quality images from input prompts. They struggle to support the consistent generation of identity-preserving requirements for storytelling. We propose a novel training-free method for consistent text-to-image generation.
arXiv Detail & Related papers (2025-01-23T10:57:22Z)
Regeneration Based Training-free Attribution of Fake Images Generated by Text-to-Image Generative Models [39.33821502730661]
We present a training-free method to attribute fake images generated by text-to-image models to their source models. By calculating and ranking the similarity of the test image and the candidate images, we can determine the source of the image.
arXiv Detail & Related papers (2024-03-03T11:55:49Z)
Direct Consistency Optimization for Compositional Text-to-Image Personalization [73.94505688626651]
Text-to-image (T2I) diffusion models, when fine-tuned on a few personal images, are able to generate visuals with a high degree of consistency. We propose to fine-tune the T2I model by maximizing consistency to reference images, while penalizing the deviation from the pretrained model.
arXiv Detail & Related papers (2024-02-19T09:52:41Z)
Reverse Stable Diffusion: What prompt was used to generate this image? [73.10116197883303]
We study the task of predicting the prompt embedding given an image generated by a generative diffusion model. We propose a novel learning framework comprising a joint prompt regression and multi-label vocabulary classification objective. We conduct experiments on the DiffusionDB data set, predicting text prompts from images generated by Stable Diffusion.
arXiv Detail & Related papers (2023-08-02T23:39:29Z)
Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion Models [94.25020178662392]
Text-to-image (T2I) research has grown explosively in the past year. One pain point persists: the text prompt engineering, and searching high-quality text prompts for customized results is more art than science. In this paper, we take "Text" out of a pre-trained T2I diffusion model, to reduce the burdensome prompt engineering efforts for users.
arXiv Detail & Related papers (2023-05-25T16:30:07Z)
LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models [62.75006608940132]
This work proposes to enhance prompt understanding capabilities in text-to-image diffusion models. Our method leverages a pretrained large language model for grounded generation in a novel two-stage process. Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images.
arXiv Detail & Related papers (2023-05-23T03:59:06Z)
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models [16.786221846896108]
We explore diffusion models for the problem of text-conditional image synthesis and compare two different guidance strategies. We find that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples. Our models can be fine-tuned to perform image inpainting, enabling powerful text-driven image editing.
arXiv Detail & Related papers (2021-12-20T18:42:55Z)
Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences. In this work, we repurpose such models to generate a descriptive text given an image at inference time. The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.