Reverse Stable Diffusion: What prompt was used to generate this image?
- URL: http://arxiv.org/abs/2308.01472v1
- Date: Wed, 2 Aug 2023 23:39:29 GMT
- Title: Reverse Stable Diffusion: What prompt was used to generate this image?
- Authors: Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, Mubarak Shah
- Abstract summary: We introduce the new task of predicting the text prompt given an image generated by a generative diffusion model.
We propose a novel learning framework comprising of a joint prompt regression and multi-label vocabulary classification objective.
We conduct experiments on the DiffusionDB data set, predicting text prompts from images generated by Stable Diffusion.
- Score: 80.82832715884597
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-image diffusion models such as Stable Diffusion have recently
attracted the interest of many researchers, and inverting the diffusion process
can play an important role in better understanding the generative process and
how to engineer prompts in order to obtain the desired images. To this end, we
introduce the new task of predicting the text prompt given an image generated
by a generative diffusion model. We combine a series of white-box and black-box
models (with and without access to the weights of the diffusion network) to
deal with the proposed task. We propose a novel learning framework comprising
of a joint prompt regression and multi-label vocabulary classification
objective that generates improved prompts. To further improve our method, we
employ a curriculum learning procedure that promotes the learning of
image-prompt pairs with lower labeling noise (i.e. that are better aligned),
and an unsupervised domain-adaptive kernel learning method that uses the
similarities between samples in the source and target domains as extra
features. We conduct experiments on the DiffusionDB data set, predicting text
prompts from images generated by Stable Diffusion. Our novel learning framework
produces excellent results on the aforementioned task, yielding the highest
gains when applied on the white-box model. In addition, we make an interesting
discovery: training a diffusion model on the prompt generation task can make
the model generate images that are much better aligned with the input prompts,
when the model is directly reused for text-to-image generation.
Related papers
- Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention
Regulation in Diffusion Models [23.786473791344395]
Cross-attention layers in diffusion models tend to disproportionately focus on certain tokens during the generation process.
We introduce attention regulation, an on-the-fly optimization approach at inference time to align attention maps with the input text prompt.
Experiment results show that our method consistently outperforms other baselines.
arXiv Detail & Related papers (2024-03-11T02:18:27Z) - Seek for Incantations: Towards Accurate Text-to-Image Diffusion
Synthesis through Prompt Engineering [118.53208190209517]
We propose a framework to learn the proper textual descriptions for diffusion models through prompt learning.
Our method can effectively learn the prompts to improve the matches between the input text and the generated images.
arXiv Detail & Related papers (2024-01-12T03:46:29Z) - Learning from Mistakes: Iterative Prompt Relabeling for Text-to-Image Diffusion Model Training [33.51524424536508]
Iterative Prompt Relabeling (IPR) is a novel algorithm that aligns images to text through iterative image sampling and prompt relabeling.
We conduct thorough experiments on SDv2 and SDXL, testing their capability to follow instructions on spatial relations.
arXiv Detail & Related papers (2023-12-23T11:10:43Z) - DiffDis: Empowering Generative Diffusion Model with Cross-Modal
Discrimination Capability [75.9781362556431]
We propose DiffDis to unify the cross-modal generative and discriminative pretraining into one single framework under the diffusion process.
We show that DiffDis outperforms single-task models on both the image generation and the image-text discriminative tasks.
arXiv Detail & Related papers (2023-08-18T05:03:48Z) - Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners [88.07317175639226]
We propose a novel approach, Discriminative Stable Diffusion (DSD), which turns pre-trained text-to-image diffusion models into few-shot discriminative learners.
Our approach mainly uses the cross-attention score of a Stable Diffusion model to capture the mutual influence between visual and textual information.
arXiv Detail & Related papers (2023-05-18T05:41:36Z) - SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with
Large Language Models [56.88192537044364]
We propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models.
Our approach can make text-to-image diffusion models easier to use with better user experience.
arXiv Detail & Related papers (2023-05-09T05:48:38Z) - In-Context Learning Unlocked for Diffusion Models [163.54453915874402]
We present Prompt Diffusion, a framework for enabling in-context learning in diffusion-based generative models.
We propose a vision-language prompt that can model a wide range of vision-language tasks and a diffusion model that takes it as input.
The resulting Prompt Diffusion model is the first diffusion-based vision-language foundation model capable of in-context learning.
arXiv Detail & Related papers (2023-05-01T23:03:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.