If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based
  Text-to-Image Generation by Selection
        - URL: http://arxiv.org/abs/2305.13308v1
- Date: Mon, 22 May 2023 17:59:41 GMT
- Title: If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based
  Text-to-Image Generation by Selection
- Authors: Shyamgopal Karthik, Karsten Roth, Massimiliano Mancini, Zeynep Akata
- Abstract summary: diffusion-based text-to-image (T2I) models can lack faithfulness to the text prompt.
We show that large T2I diffusion models are more faithful than usually assumed, and can generate images faithful to even complex prompts.
We introduce a pipeline that generates candidate images for a text prompt and picks the best one according to an automatic scoring system.
- Score: 53.320946030761796
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   Despite their impressive capabilities, diffusion-based text-to-image (T2I)
models can lack faithfulness to the text prompt, where generated images may not
contain all the mentioned objects, attributes or relations. To alleviate these
issues, recent works proposed post-hoc methods to improve model faithfulness
without costly retraining, by modifying how the model utilizes the input
prompt. In this work, we take a step back and show that large T2I diffusion
models are more faithful than usually assumed, and can generate images faithful
to even complex prompts without the need to manipulate the generative process.
Based on that, we show how faithfulness can be simply treated as a candidate
selection problem instead, and introduce a straightforward pipeline that
generates candidate images for a text prompt and picks the best one according
to an automatic scoring system that can leverage already existing T2I
evaluation metrics. Quantitative comparisons alongside user studies on diverse
benchmarks show consistently improved faithfulness over post-hoc enhancement
methods, with comparable or lower computational cost. Code is available at
\url{https://github.com/ExplainableML/ImageSelect}.
 
      
        Related papers
        - Aligning Text to Image in Diffusion Models is Easier Than You Think [47.623236425067326]
 We introduce a lightweight contrastive fine tuning strategy called SoftREPA that uses soft text tokens.
Our method explicitly increases the mutual information between text and image representations, leading to enhanced semantic consistency.
 arXiv  Detail & Related papers  (2025-03-11T10:14:22Z)
- One-Prompt-One-Story: Free-Lunch Consistent Text-to-Image Generation   Using a Single Prompt [101.17660804110409]
 Text-to-image generation models can create high-quality images from input prompts.
They struggle to support the consistent generation of identity-preserving requirements for storytelling.
We propose a novel training-free method for consistent text-to-image generation.
 arXiv  Detail & Related papers  (2025-01-23T10:57:22Z)
- Origin Identification for Text-Guided Image-to-Image Diffusion Models [39.234894330025114]
 We propose origin IDentification for text-guided Image-to-image Diffusion models (ID$2$)<n>A straightforward solution to ID$2$ involves training a specialized deep embedding model to extract and compare features from both query and reference images.<n>To solve this challenge of the proposed ID$2$ task, we contribute the first dataset and a theoretically guaranteed method.
 arXiv  Detail & Related papers  (2025-01-04T20:34:53Z)
- Dynamic Prompt Optimizing for Text-to-Image Generation [63.775458908172176]
 We introduce the textbfPrompt textbfAuto-textbfEditing (PAE) method to improve text-to-image generative models.
We employ an online reinforcement learning strategy to explore the weights and injection time steps of each word, leading to the dynamic fine-control prompts.
 arXiv  Detail & Related papers  (2024-04-05T13:44:39Z)
- Regeneration Based Training-free Attribution of Fake Images Generated by
  Text-to-Image Generative Models [39.33821502730661]
 We present a training-free method to attribute fake images generated by text-to-image models to their source models.
By calculating and ranking the similarity of the test image and the candidate images, we can determine the source of the image.
 arXiv  Detail & Related papers  (2024-03-03T11:55:49Z)
- Direct Consistency Optimization for Compositional Text-to-Image
  Personalization [73.94505688626651]
 Text-to-image (T2I) diffusion models, when fine-tuned on a few personal images, are able to generate visuals with a high degree of consistency.
We propose to fine-tune the T2I model by maximizing consistency to reference images, while penalizing the deviation from the pretrained model.
 arXiv  Detail & Related papers  (2024-02-19T09:52:41Z)
- Reverse Stable Diffusion: What prompt was used to generate this image? [73.10116197883303]
 We study the task of predicting the prompt embedding given an image generated by a generative diffusion model.
We propose a novel learning framework comprising a joint prompt regression and multi-label vocabulary classification objective.
We conduct experiments on the DiffusionDB data set, predicting text prompts from images generated by Stable Diffusion.
 arXiv  Detail & Related papers  (2023-08-02T23:39:29Z)
- Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion
  Models [94.25020178662392]
 Text-to-image (T2I) research has grown explosively in the past year.
One pain point persists: the text prompt engineering, and searching high-quality text prompts for customized results is more art than science.
In this paper, we take "Text" out of a pre-trained T2I diffusion model, to reduce the burdensome prompt engineering efforts for users.
 arXiv  Detail & Related papers  (2023-05-25T16:30:07Z)
- LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image
  Diffusion Models with Large Language Models [62.75006608940132]
 This work proposes to enhance prompt understanding capabilities in text-to-image diffusion models.
Our method leverages a pretrained large language model for grounded generation in a novel two-stage process.
Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images.
 arXiv  Detail & Related papers  (2023-05-23T03:59:06Z)
- GLIDE: Towards Photorealistic Image Generation and Editing with
  Text-Guided Diffusion Models [16.786221846896108]
 We explore diffusion models for the problem of text-conditional image synthesis and compare two different guidance strategies.
We find that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples.
Our models can be fine-tuned to perform image inpainting, enabling powerful text-driven image editing.
 arXiv  Detail & Related papers  (2021-12-20T18:42:55Z)
- Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
 Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences.
In this work, we repurpose such models to generate a descriptive text given an image at inference time.
The resulting captions are much less restrictive than those obtained by supervised captioning methods.
 arXiv  Detail & Related papers  (2021-11-29T11:01:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.