Simultaneous Multiple-Prompt Guided Generation Using Differentiable
Optimal Transport
- URL: http://arxiv.org/abs/2204.08472v1
- Date: Mon, 18 Apr 2022 03:46:06 GMT
- Title: Simultaneous Multiple-Prompt Guided Generation Using Differentiable
Optimal Transport
- Authors: Yingtao Tian and Marco Cuturi and David Ha
- Abstract summary: Text-to-image synthesis approaches that operate by generating images from text cues provide a case in point.
We propose using matching techniques found in the optimal transport (OT) literature, resulting in images that are able to reflect faithfully a wide diversity of prompts.
- Score: 41.265684813975625
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in deep learning, such as powerful generative models and
joint text-image embeddings, have provided the computational creativity
community with new tools, opening new perspectives for artistic pursuits.
Text-to-image synthesis approaches that operate by generating images from text
cues provide a case in point. These images are generated with a latent vector
that is progressively refined to agree with text cues. To do so, patches are
sampled within the generated image, and compared with the text prompts in the
common text-image embedding space; The latent vector is then updated, using
gradient descent, to reduce the mean (average) distance between these patches
and text cues. While this approach provides artists with ample freedom to
customize the overall appearance of images, through their choice in generative
models, the reliance on a simple criterion (mean of distances) often causes
mode collapse: The entire image is drawn to the average of all text cues,
thereby losing their diversity. To address this issue, we propose using
matching techniques found in the optimal transport (OT) literature, resulting
in images that are able to reflect faithfully a wide diversity of prompts. We
provide numerous illustrations showing that OT avoids some of the pitfalls
arising from estimating vectors with mean distances, and demonstrate the
capacity of our proposed method to perform better in experiments, qualitatively
and quantitatively.
Related papers
- Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image
Diffusion Models [46.18013380882767]
This work focuses on inverting the diffusion model to obtain interpretable language prompts directly.
We leverage the findings that different timesteps of the diffusion process cater to different levels of detail in an image.
We show that our approach can identify semantically interpretable and meaningful prompts for a target image.
arXiv Detail & Related papers (2023-12-19T18:47:30Z) - UDiffText: A Unified Framework for High-quality Text Synthesis in
Arbitrary Images via Character-aware Diffusion Models [25.219960711604728]
This paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model.
Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder.
By employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images.
arXiv Detail & Related papers (2023-12-08T07:47:46Z) - Towards Real-time Text-driven Image Manipulation with Unconditional
Diffusion Models [33.993466872389085]
We develop a novel algorithm that learns image manipulations 4.5-10 times faster and applies them 8 times faster.
Our approach can adapt the pretrained model to the user-specified image and text description on the fly just for 4 seconds.
arXiv Detail & Related papers (2023-04-10T01:21:56Z) - Unified Multi-Modal Latent Diffusion for Joint Subject and Text
Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences.
To be more specific, both input texts and images are encoded into one unified multi-modal latent space.
Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z) - SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control.
In addition to a global text prompt that describes the entire scene, the user provides a segmentation map.
We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z) - Cycle-Consistent Inverse GAN for Text-to-Image Synthesis [101.97397967958722]
We propose a novel unified framework of Cycle-consistent Inverse GAN for both text-to-image generation and text-guided image manipulation tasks.
We learn a GAN inversion model to convert the images back to the GAN latent space and obtain the inverted latent codes for each image.
In the text-guided optimization module, we generate images with the desired semantic attributes by optimizing the inverted latent codes.
arXiv Detail & Related papers (2021-08-03T08:38:16Z) - Towards Open-World Text-Guided Face Image Generation and Manipulation [52.83401421019309]
We propose a unified framework for both face image generation and manipulation.
Our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing.
arXiv Detail & Related papers (2021-04-18T16:56:07Z) - TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions.
StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN.
visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space.
instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.