Prompt Mixing in Diffusion Models using the Black Scholes Algorithm
- URL: http://arxiv.org/abs/2405.13685v1
- Date: Wed, 22 May 2024 14:25:57 GMT
- Title: Prompt Mixing in Diffusion Models using the Black Scholes Algorithm
- Authors: Divya Kothandaraman, Ming Lin, Dinesh Manocha,
- Abstract summary: We introduce a novel approach for prompt mixing, aiming to generate images at the intersection of multiple text prompts.
We leverage the connection between diffusion models and the Black-Scholes model for pricing options in Finance.
Our prompt-mixing algorithm is data-efficient, meaning it does not need additional training.
- Score: 57.03116054807942
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce a novel approach for prompt mixing, aiming to generate images at the intersection of multiple text prompts using pre-trained text-to-image diffusion models. At each time step during diffusion denoising, our algorithm forecasts predictions w.r.t. the generated image and makes informed text conditioning decisions. To do so, we leverage the connection between diffusion models (rooted in non-equilibrium thermodynamics) and the Black-Scholes model for pricing options in Finance, and draw analogies between the variables in both contexts to derive an appropriate algorithm for prompt mixing using the Black Scholes model. Specifically, the parallels between diffusion models and the Black-Scholes model enable us to leverage properties related to the dynamics of the Markovian model derived in the Black-Scholes algorithm. Our prompt-mixing algorithm is data-efficient, meaning it does not need additional training. Furthermore, it operates without human intervention or hyperparameter tuning. We highlight the benefits of our approach by comparing it qualitatively and quantitatively to other prompt mixing techniques, including linear interpolation, alternating prompts, step-wise prompt switching, and CLIP-guided prompt selection across various scenarios such as single object per text prompt, multiple objects per text prompt and objects against backgrounds. Code is available at https://github.com/divyakraman/BlackScholesDiffusion2024.
Related papers
- Test-time Conditional Text-to-Image Synthesis Using Diffusion Models [15.24270990274781]
TINTIN: Test-time Conditional Text-to-Image Synthesis using Diffusion Models is a new training-free test-time only algorithm.
We demonstrate significant improvement over the current state-of-the-art, both qualitatively and quantitatively.
arXiv Detail & Related papers (2024-11-16T13:32:18Z) - Scaling Concept With Text-Guided Diffusion Models [53.80799139331966]
Instead of replacing a concept, can we enhance or suppress the concept itself?
We introduce ScalingConcept, a simple yet effective method to scale decomposed concepts up or down in real input without introducing new elements.
More importantly, ScalingConcept enables a variety of novel zero-shot applications across image and audio domains.
arXiv Detail & Related papers (2024-10-31T17:09:55Z) - Improving GFlowNets for Text-to-Image Diffusion Alignment [48.42367859859971]
We explore techniques that do not directly maximize the reward but rather generate high-reward images with relatively high probability.
Our method could effectively align large-scale text-to-image diffusion models with given reward information.
arXiv Detail & Related papers (2024-06-02T06:36:46Z) - FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models [56.71672127740099]
We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets.
We leverage different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation.
Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets.
arXiv Detail & Related papers (2024-03-29T10:38:25Z) - Isolated Diffusion: Optimizing Multi-Concept Text-to-Image Generation Training-Freely with Isolated Diffusion Guidance [19.221431052643222]
This paper presents a general approach for text-to-image diffusion models to address the mutual interference between different subjects and their attachments in complex scenes.
We propose to bind each attachment to corresponding subjects separately with split text prompts.
We then isolate and resynthesize each subject individually with corresponding text prompts to avoid mutual interference.
arXiv Detail & Related papers (2024-03-25T17:16:27Z) - Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention
Regulation in Diffusion Models [23.786473791344395]
Cross-attention layers in diffusion models tend to disproportionately focus on certain tokens during the generation process.
We introduce attention regulation, an on-the-fly optimization approach at inference time to align attention maps with the input text prompt.
Experiment results show that our method consistently outperforms other baselines.
arXiv Detail & Related papers (2024-03-11T02:18:27Z) - Multi-Concept T2I-Zero: Tweaking Only The Text Embeddings and Nothing
Else [75.6806649860538]
We consider a more ambitious goal: natural multi-concept generation using a pre-trained diffusion model.
We observe concept dominance and non-localized contribution that severely degrade multi-concept generation performance.
We design a minimal low-cost solution that overcomes the above issues by tweaking the text embeddings for more realistic multi-concept text-to-image generation.
arXiv Detail & Related papers (2023-10-11T12:05:44Z) - MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask [84.84034179136458]
A crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning.
We propose an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features.
Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models.
arXiv Detail & Related papers (2023-09-08T15:53:37Z) - Reverse Stable Diffusion: What prompt was used to generate this image? [73.10116197883303]
We study the task of predicting the prompt embedding given an image generated by a generative diffusion model.
We propose a novel learning framework comprising a joint prompt regression and multi-label vocabulary classification objective.
We conduct experiments on the DiffusionDB data set, predicting text prompts from images generated by Stable Diffusion.
arXiv Detail & Related papers (2023-08-02T23:39:29Z) - Break-A-Scene: Extracting Multiple Concepts from a Single Image [80.47666266017207]
We introduce the task of textual scene decomposition.
We propose augmenting the input image with masks that indicate the presence of target concepts.
We then present a novel two-phase customization process.
arXiv Detail & Related papers (2023-05-25T17:59:04Z) - LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image
Diffusion Models with Large Language Models [62.75006608940132]
This work proposes to enhance prompt understanding capabilities in text-to-image diffusion models.
Our method leverages a pretrained large language model for grounded generation in a novel two-stage process.
Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images.
arXiv Detail & Related papers (2023-05-23T03:59:06Z) - Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners [88.07317175639226]
We propose a novel approach, Discriminative Stable Diffusion (DSD), which turns pre-trained text-to-image diffusion models into few-shot discriminative learners.
Our approach mainly uses the cross-attention score of a Stable Diffusion model to capture the mutual influence between visual and textual information.
arXiv Detail & Related papers (2023-05-18T05:41:36Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.