Related papers: Simultaneous Multiple-Prompt Guided Generation Using Differentiable Optimal Transport

Simultaneous Multiple-Prompt Guided Generation Using Differentiable Optimal Transport

URL: http://arxiv.org/abs/2204.08472v1
Date: Mon, 18 Apr 2022 03:46:06 GMT
Title: Simultaneous Multiple-Prompt Guided Generation Using Differentiable Optimal Transport
Authors: Yingtao Tian and Marco Cuturi and David Ha
Abstract summary: Text-to-image synthesis approaches that operate by generating images from text cues provide a case in point. We propose using matching techniques found in the optimal transport (OT) literature, resulting in images that are able to reflect faithfully a wide diversity of prompts.
Score: 41.265684813975625
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in deep learning, such as powerful generative models and joint text-image embeddings, have provided the computational creativity community with new tools, opening new perspectives for artistic pursuits. Text-to-image synthesis approaches that operate by generating images from text cues provide a case in point. These images are generated with a latent vector that is progressively refined to agree with text cues. To do so, patches are sampled within the generated image, and compared with the text prompts in the common text-image embedding space; The latent vector is then updated, using gradient descent, to reduce the mean (average) distance between these patches and text cues. While this approach provides artists with ample freedom to customize the overall appearance of images, through their choice in generative models, the reliance on a simple criterion (mean of distances) often causes mode collapse: The entire image is drawn to the average of all text cues, thereby losing their diversity. To address this issue, we propose using matching techniques found in the optimal transport (OT) literature, resulting in images that are able to reflect faithfully a wide diversity of prompts. We provide numerous illustrations showing that OT avoids some of the pitfalls arising from estimating vectors with mean distances, and demonstrate the capacity of our proposed method to perform better in experiments, qualitatively and quantitatively.

Related papers

Geometric Disentanglement of Text Embeddings for Subject-Consistent Text-to-Image Generation using A Single Prompt [14.734857939203811]
We propose a training-free approach that addresses semantic entanglement from a subject perspective.<n>Our approach significantly improves both subject consistency and text alignment over existing baselines.
arXiv Detail & Related papers (2025-12-18T11:55:06Z)
Color Bind: Exploring Color Perception in Text-to-Image Models [40.094195503306295]
We introduce a dedicated image editing technique, mitigating the issue of multi-object semantic alignment for prompts containing multiple colors.<n>Our approach significantly boosts performance over a wide range of metrics, considering images generated by various text-to-image diffusion-based techniques.
arXiv Detail & Related papers (2025-08-27T11:16:58Z)
FocusDiff: Advancing Fine-Grained Text-Image Alignment for Autoregressive Visual Generation through RL [78.59912944698992]
We propose FocusDiff to enhance fine-grained text-image semantic alignment.<n>We construct a new dataset of paired texts and images with similar overall expressions but distinct local semantics.<n>Our approach achieves state-of-the-art performance on existing text-to-image benchmarks and significantly outperforms prior methods on PairComp.
arXiv Detail & Related papers (2025-06-05T18:36:33Z)
ControlThinker: Unveiling Latent Semantics for Controllable Image Generation through Visual Reasoning [76.2503352325492]
ControlThinker is a novel framework that employs a "comprehend-then-generate" paradigm.<n>Latent semantics from control images are mined to enrich text prompts.<n>This enriched semantic understanding then seamlessly aids in image generation without the need for additional complex modifications.
arXiv Detail & Related papers (2025-06-04T05:56:19Z)
EDITOR: Effective and Interpretable Prompt Inversion for Text-to-Image Diffusion Models [31.31018600797305]
We propose a prompt inversion technique called sys for text-to-image diffusion models.<n>Our method outperforms existing methods in terms of image similarity, textual alignment, prompt interpretability and generalizability.
arXiv Detail & Related papers (2025-06-03T16:44:15Z)
STRICT: Stress Test of Rendering Images Containing Text [11.236527918747925]
$textbfSTRICT$ is a benchmark designed to stress-test the ability of diffusion models to render coherent and instruction-aligned text in images.<n>We evaluate several state-of-the-art models, including proprietary and open-source variants, and reveal persistent limitations in long-range consistency and instruction-following capabilities.
arXiv Detail & Related papers (2025-05-25T05:37:08Z)
Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image Diffusion Models [46.18013380882767]
This work focuses on inverting the diffusion model to obtain interpretable language prompts directly. We leverage the findings that different timesteps of the diffusion process cater to different levels of detail in an image. We show that our approach can identify semantically interpretable and meaningful prompts for a target image.
arXiv Detail & Related papers (2023-12-19T18:47:30Z)
UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models [25.219960711604728]
This paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model. Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder. By employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images.
arXiv Detail & Related papers (2023-12-08T07:47:46Z)
Towards Real-time Text-driven Image Manipulation with Unconditional Diffusion Models [33.993466872389085]
We develop a novel algorithm that learns image manipulations 4.5-10 times faster and applies them 8 times faster. Our approach can adapt the pretrained model to the user-specified image and text description on the fly just for 4 seconds.
arXiv Detail & Related papers (2023-04-10T01:21:56Z)
Unified Multi-Modal Latent Diffusion for Joint Subject and Text Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences. To be more specific, both input texts and images are encoded into one unified multi-modal latent space. Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z)
SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control. In addition to a global text prompt that describes the entire scene, the user provides a segmentation map. We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z)
Cycle-Consistent Inverse GAN for Text-to-Image Synthesis [101.97397967958722]
We propose a novel unified framework of Cycle-consistent Inverse GAN for both text-to-image generation and text-guided image manipulation tasks. We learn a GAN inversion model to convert the images back to the GAN latent space and obtain the inverted latent codes for each image. In the text-guided optimization module, we generate images with the desired semantic attributes by optimizing the inverted latent codes.
arXiv Detail & Related papers (2021-08-03T08:38:16Z)
Towards Open-World Text-Guided Face Image Generation and Manipulation [52.83401421019309]
We propose a unified framework for both face image generation and manipulation. Our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing.
arXiv Detail & Related papers (2021-04-18T16:56:07Z)
TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions. StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN. visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space. instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.