Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image
Diffusion Models
- URL: http://arxiv.org/abs/2312.12416v1
- Date: Tue, 19 Dec 2023 18:47:30 GMT
- Title: Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image
Diffusion Models
- Authors: Shweta Mahajan, Tanzila Rahman, Kwang Moo Yi, Leonid Sigal
- Abstract summary: This work focuses on inverting the diffusion model to obtain interpretable language prompts directly.
We leverage the findings that different timesteps of the diffusion process cater to different levels of detail in an image.
We show that our approach can identify semantically interpretable and meaningful prompts for a target image.
- Score: 46.18013380882767
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The quality of the prompts provided to text-to-image diffusion models
determines how faithful the generated content is to the user's intent, often
requiring `prompt engineering'. To harness visual concepts from target images
without prompt engineering, current approaches largely rely on embedding
inversion by optimizing and then mapping them to pseudo-tokens. However,
working with such high-dimensional vector representations is challenging
because they lack semantics and interpretability, and only allow simple vector
operations when using them. Instead, this work focuses on inverting the
diffusion model to obtain interpretable language prompts directly. The
challenge of doing this lies in the fact that the resulting optimization
problem is fundamentally discrete and the space of prompts is exponentially
large; this makes using standard optimization techniques, such as stochastic
gradient descent, difficult. To this end, we utilize a delayed projection
scheme to optimize for prompts representative of the vocabulary space in the
model. Further, we leverage the findings that different timesteps of the
diffusion process cater to different levels of detail in an image. The later,
noisy, timesteps of the forward diffusion process correspond to the semantic
information, and therefore, prompt inversion in this range provides tokens
representative of the image semantics. We show that our approach can identify
semantically interpretable and meaningful prompts for a target image which can
be used to synthesize diverse images with similar content. We further
illustrate the application of the optimized prompts in evolutionary image
generation and concept removal.
Related papers
- Prefix-diffusion: A Lightweight Diffusion Model for Diverse Image
Captioning [36.4086473737433]
We propose a lightweight image captioning network in combination with continuous diffusion, called Prefix-diffusion.
To achieve diversity, we design an efficient method that injects prefix image embeddings into the denoising process of the diffusion model.
In order to reduce trainable parameters, we employ a pre-trained model to extract image features and further design an extra mapping network.
arXiv Detail & Related papers (2023-09-10T08:55:24Z) - MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask [84.84034179136458]
A crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning.
We propose an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features.
Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models.
arXiv Detail & Related papers (2023-09-08T15:53:37Z) - Improving Diffusion-based Image Translation using Asymmetric Gradient
Guidance [51.188396199083336]
We present an approach that guides the reverse process of diffusion sampling by applying asymmetric gradient guidance.
Our model's adaptability allows it to be implemented with both image-fusion and latent-dif models.
Experiments show that our method outperforms various state-of-the-art models in image translation tasks.
arXiv Detail & Related papers (2023-06-07T12:56:56Z) - SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with
Large Language Models [56.88192537044364]
We propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models.
Our approach can make text-to-image diffusion models easier to use with better user experience.
arXiv Detail & Related papers (2023-05-09T05:48:38Z) - Cap2Aug: Caption guided Image to Image data Augmentation [41.53127698828463]
Cap2Aug is an image-to-image diffusion model-based data augmentation strategy using image captions as text prompts.
We generate captions from the limited training images and using these captions edit the training images using an image-to-image stable diffusion model.
This strategy generates augmented versions of images similar to the training images yet provides semantic diversity across the samples.
arXiv Detail & Related papers (2022-12-11T04:37:43Z) - High-Fidelity Guided Image Synthesis with Latent Diffusion Models [50.39294302741698]
The proposed approach outperforms the previous state-of-the-art by over 85.32% on the overall user satisfaction scores.
Human user study results show that the proposed approach outperforms the previous state-of-the-art by over 85.32% on the overall user satisfaction scores.
arXiv Detail & Related papers (2022-11-30T15:43:20Z) - Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.