Reverse Stable Diffusion: What prompt was used to generate this image?
- URL: http://arxiv.org/abs/2308.01472v1
- Date: Wed, 2 Aug 2023 23:39:29 GMT
- Title: Reverse Stable Diffusion: What prompt was used to generate this image?
- Authors: Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, Mubarak Shah
- Abstract summary: We introduce the new task of predicting the text prompt given an image generated by a generative diffusion model.
We propose a novel learning framework comprising of a joint prompt regression and multi-label vocabulary classification objective.
We conduct experiments on the DiffusionDB data set, predicting text prompts from images generated by Stable Diffusion.
- Score: 80.82832715884597
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-image diffusion models such as Stable Diffusion have recently
attracted the interest of many researchers, and inverting the diffusion process
can play an important role in better understanding the generative process and
how to engineer prompts in order to obtain the desired images. To this end, we
introduce the new task of predicting the text prompt given an image generated
by a generative diffusion model. We combine a series of white-box and black-box
models (with and without access to the weights of the diffusion network) to
deal with the proposed task. We propose a novel learning framework comprising
of a joint prompt regression and multi-label vocabulary classification
objective that generates improved prompts. To further improve our method, we
employ a curriculum learning procedure that promotes the learning of
image-prompt pairs with lower labeling noise (i.e. that are better aligned),
and an unsupervised domain-adaptive kernel learning method that uses the
similarities between samples in the source and target domains as extra
features. We conduct experiments on the DiffusionDB data set, predicting text
prompts from images generated by Stable Diffusion. Our novel learning framework
produces excellent results on the aforementioned task, yielding the highest
gains when applied on the white-box model. In addition, we make an interesting
discovery: training a diffusion model on the prompt generation task can make
the model generate images that are much better aligned with the input prompts,
when the model is directly reused for text-to-image generation.
Related papers
- Prompt Mixing in Diffusion Models using the Black Scholes Algorithm [57.03116054807942]
We introduce a novel approach for prompt mixing, aiming to generate images at the intersection of multiple text prompts.
We leverage the connection between diffusion models and the Black-Scholes model for pricing options in Finance.
Our prompt-mixing algorithm is data-efficient, meaning it does not need additional training.
arXiv Detail & Related papers (2024-05-22T14:25:57Z) - LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image
Diffusion Models with Large Language Models [62.75006608940132]
This work proposes to enhance prompt understanding capabilities in text-to-image diffusion models.
Our method leverages a pretrained large language model for grounded generation in a novel two-stage process.
Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images.
arXiv Detail & Related papers (2023-05-23T03:59:06Z) - Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners [88.07317175639226]
We propose a novel approach, Discriminative Stable Diffusion (DSD), which turns pre-trained text-to-image diffusion models into few-shot discriminative learners.
Our approach mainly uses the cross-attention score of a Stable Diffusion model to capture the mutual influence between visual and textual information.
arXiv Detail & Related papers (2023-05-18T05:41:36Z) - In-Context Learning Unlocked for Diffusion Models [163.54453915874402]
We present Prompt Diffusion, a framework for enabling in-context learning in diffusion-based generative models.
We propose a vision-language prompt that can model a wide range of vision-language tasks and a diffusion model that takes it as input.
The resulting Prompt Diffusion model is the first diffusion-based vision-language foundation model capable of in-context learning.
arXiv Detail & Related papers (2023-05-01T23:03:37Z) - Discriminative Class Tokens for Text-to-Image Diffusion Models [102.88033622546251]
We propose a non-invasive fine-tuning technique that capitalizes on the expressive potential of free-form text.
Our method is fast compared to prior fine-tuning methods and does not require a collection of in-class images.
We evaluate our method extensively, showing that the generated images are: (i) more accurate and of higher quality than standard diffusion models, (ii) can be used to augment training data in a low-resource setting, and (iii) reveal information about the data used to train the guiding classifier.
arXiv Detail & Related papers (2023-03-30T05:25:20Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.