Controlled and Conditional Text to Image Generation with Diffusion Prior
- URL: http://arxiv.org/abs/2302.11710v2
- Date: Tue, 1 Aug 2023 07:33:10 GMT
- Title: Controlled and Conditional Text to Image Generation with Diffusion Prior
- Authors: Pranav Aggarwal, Hareesh Ravi, Naveen Marri, Sachin Kelkar, Fengbin
Chen, Vinh Khuc, Midhun Harikumar, Ritiz Tambi, Sudharshan Reddy Kakumanu,
Purvak Lapsiya, Alvin Ghouas, Sarah Saber, Malavika Ramprasad, Baldo Faieta,
Ajinkya Kale
- Abstract summary: DALLE-2's two step process comprises a Diffusion Prior that generates a CLIP image embedding from text and a Diffusion Decoder that generates an image from a CLIP image embedding.
We show quantitatively and qualitatively that the proposed approaches perform better than prompt engineering for domain specific generation and existing baselines for color conditioned generation.
- Score: 1.8690858882873838
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Denoising Diffusion models have shown remarkable performance in generating
diverse, high quality images from text. Numerous techniques have been proposed
on top of or in alignment with models like Stable Diffusion and Imagen that
generate images directly from text. A lesser explored approach is DALLE-2's two
step process comprising a Diffusion Prior that generates a CLIP image embedding
from text and a Diffusion Decoder that generates an image from a CLIP image
embedding. We explore the capabilities of the Diffusion Prior and the
advantages of an intermediate CLIP representation. We observe that Diffusion
Prior can be used in a memory and compute efficient way to constrain the
generation to a specific domain without altering the larger Diffusion Decoder.
Moreover, we show that the Diffusion Prior can be trained with additional
conditional information such as color histogram to further control the
generation. We show quantitatively and qualitatively that the proposed
approaches perform better than prompt engineering for domain specific
generation and existing baselines for color conditioned generation. We believe
that our observations and results will instigate further research into the
diffusion prior and uncover more of its capabilities.
Related papers
- Improving GFlowNets for Text-to-Image Diffusion Alignment [48.42367859859971]
We explore techniques that do not directly maximize the reward but rather generate high-reward images with relatively high probability.
Our method could effectively align large-scale text-to-image diffusion models with given reward information.
arXiv Detail & Related papers (2024-06-02T06:36:46Z) - Improving Diffusion-Based Image Synthesis with Context Prediction [49.186366441954846]
Existing diffusion models mainly try to reconstruct input image from a corrupted one with a pixel-wise or feature-wise constraint along spatial axes.
We propose ConPreDiff to improve diffusion-based image synthesis with context prediction.
Our ConPreDiff consistently outperforms previous methods and achieves a new SOTA text-to-image generation results on MS-COCO, with a zero-shot FID score of 6.21.
arXiv Detail & Related papers (2024-01-04T01:10:56Z) - MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask [84.84034179136458]
A crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning.
We propose an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features.
Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models.
arXiv Detail & Related papers (2023-09-08T15:53:37Z) - DiffDis: Empowering Generative Diffusion Model with Cross-Modal
Discrimination Capability [75.9781362556431]
We propose DiffDis to unify the cross-modal generative and discriminative pretraining into one single framework under the diffusion process.
We show that DiffDis outperforms single-task models on both the image generation and the image-text discriminative tasks.
arXiv Detail & Related papers (2023-08-18T05:03:48Z) - Reverse Stable Diffusion: What prompt was used to generate this image? [73.10116197883303]
We study the task of predicting the prompt embedding given an image generated by a generative diffusion model.
We propose a novel learning framework comprising a joint prompt regression and multi-label vocabulary classification objective.
We conduct experiments on the DiffusionDB data set, predicting text prompts from images generated by Stable Diffusion.
arXiv Detail & Related papers (2023-08-02T23:39:29Z) - Diffusion in Diffusion: Cyclic One-Way Diffusion for Text-Vision-Conditioned Generation [11.80682025950519]
In this work, we investigate the diffusion (physics) in diffusion (machine learning) properties.
We propose our Cyclic One-Way Diffusion (COW) method to control the direction of diffusion phenomenon.
Our method provides a novel perspective to understand the task needs and is applicable to a wider range of customization scenarios.
arXiv Detail & Related papers (2023-06-14T05:25:06Z) - Nested Diffusion Processes for Anytime Image Generation [38.84966342097197]
We propose an anytime diffusion-based method that can generate viable images when stopped at arbitrary times before completion.
In experiments on ImageNet and Stable Diffusion-based text-to-image generation, we show, both qualitatively and quantitatively, that our method's intermediate generation quality greatly exceeds that of the original diffusion model.
arXiv Detail & Related papers (2023-05-30T14:28:43Z) - Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners [88.07317175639226]
We propose a novel approach, Discriminative Stable Diffusion (DSD), which turns pre-trained text-to-image diffusion models into few-shot discriminative learners.
Our approach mainly uses the cross-attention score of a Stable Diffusion model to capture the mutual influence between visual and textual information.
arXiv Detail & Related papers (2023-05-18T05:41:36Z) - ADIR: Adaptive Diffusion for Image Reconstruction [46.838084286784195]
We propose a conditional sampling scheme that exploits the prior learned by diffusion models.
We then combine it with a novel approach for adapting pretrained diffusion denoising networks to their input.
We show that our proposed adaptive diffusion for image reconstruction' approach achieves a significant improvement in the super-resolution, deblurring, and text-based editing tasks.
arXiv Detail & Related papers (2022-12-06T18:39:58Z) - DiffusionCLIP: Text-guided Image Manipulation Using Diffusion Models [33.79188588182528]
We present a novel DiffusionCLIP which performs text-driven image manipulation with diffusion models using Contrastive Language-Image Pre-training (CLIP) loss.
Our method has a performance comparable to that of the modern GAN-based image processing methods for in and out-of-domain image processing tasks.
Our method can be easily used for various novel applications, enabling image translation from an unseen domain to another unseen domain or stroke-conditioned image generation in an unseen domain.
arXiv Detail & Related papers (2021-10-06T12:59:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.