Unified Multi-Modal Latent Diffusion for Joint Subject and Text
Conditional Image Generation
- URL: http://arxiv.org/abs/2303.09319v1
- Date: Thu, 16 Mar 2023 13:50:20 GMT
- Title: Unified Multi-Modal Latent Diffusion for Joint Subject and Text
Conditional Image Generation
- Authors: Yiyang Ma, Huan Yang, Wenjing Wang, Jianlong Fu, Jiaying Liu
- Abstract summary: We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences.
To be more specific, both input texts and images are encoded into one unified multi-modal latent space.
Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
- Score: 63.061871048769596
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language-guided image generation has achieved great success nowadays by using
diffusion models. However, texts can be less detailed to describe
highly-specific subjects such as a particular dog or a certain car, which makes
pure text-to-image generation not accurate enough to satisfy user requirements.
In this work, we present a novel Unified Multi-Modal Latent Diffusion
(UMM-Diffusion) which takes joint texts and images containing specified
subjects as input sequences and generates customized images with the subjects.
To be more specific, both input texts and images are encoded into one unified
multi-modal latent space, in which the input images are learned to be projected
to pseudo word embedding and can be further combined with text to guide image
generation. Besides, to eliminate the irrelevant parts of the input images such
as background or illumination, we propose a novel sampling technique of
diffusion models used by the image generator which fuses the results guided by
multi-modal input and pure text input. By leveraging the large-scale
pre-trained text-to-image generator and the designed image encoder, our method
is able to generate high-quality images with complex semantics from both
aspects of input texts and images.
Related papers
- Visual Text Generation in the Wild [67.37458807253064]
We propose a visual text generator (termed SceneVTG) which can produce high-quality text images in the wild.
The proposed SceneVTG significantly outperforms traditional rendering-based methods and recent diffusion-based methods in terms of fidelity and reasonability.
The generated images provide superior utility for tasks involving text detection and text recognition.
arXiv Detail & Related papers (2024-07-19T09:08:20Z) - UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion [36.06457895469353]
UNIMO-G is a conditional diffusion framework that operates on multimodal prompts with interleaved textual and visual inputs.
It excels in both text-to-image generation and zero-shot subject-driven synthesis.
arXiv Detail & Related papers (2024-01-24T11:36:44Z) - Seek for Incantations: Towards Accurate Text-to-Image Diffusion
Synthesis through Prompt Engineering [118.53208190209517]
We propose a framework to learn the proper textual descriptions for diffusion models through prompt learning.
Our method can effectively learn the prompts to improve the matches between the input text and the generated images.
arXiv Detail & Related papers (2024-01-12T03:46:29Z) - UDiffText: A Unified Framework for High-quality Text Synthesis in
Arbitrary Images via Character-aware Diffusion Models [25.219960711604728]
This paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model.
Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder.
By employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images.
arXiv Detail & Related papers (2023-12-08T07:47:46Z) - De-Diffusion Makes Text a Strong Cross-Modal Interface [33.90004746543745]
We employ an autoencoder that uses a pre-trained text-to-image diffusion model for decoding.
Experiments validate the precision and comprehensiveness of De-Diffusion text representing images.
A single De-Diffusion model can generalize to provide transferable prompts for different text-to-image tools.
arXiv Detail & Related papers (2023-11-01T16:12:40Z) - Zero-shot spatial layout conditioning for text-to-image diffusion models [52.24744018240424]
Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling.
We consider image generation from text associated with segments on the image canvas, which combines an intuitive natural language interface with precise spatial control over the generated content.
We propose ZestGuide, a zero-shot segmentation guidance approach that can be plugged into pre-trained text-to-image diffusion models.
arXiv Detail & Related papers (2023-06-23T19:24:48Z) - GlyphDiffusion: Text Generation as Image Generation [100.98428068214736]
We propose GlyphDiffusion, a novel diffusion approach for text generation via text-guided image generation.
Our key idea is to render the target text as a glyph image containing visual language content.
Our model also makes significant improvements compared to the recent diffusion model.
arXiv Detail & Related papers (2023-04-25T02:14:44Z) - HumanDiffusion: a Coarse-to-Fine Alignment Diffusion Framework for
Controllable Text-Driven Person Image Generation [73.3790833537313]
Controllable person image generation promotes a wide range of applications such as digital human interaction and virtual try-on.
We propose HumanDiffusion, a coarse-to-fine alignment diffusion framework, for text-driven person image generation.
arXiv Detail & Related papers (2022-11-11T14:30:34Z) - Unconditional Image-Text Pair Generation with Multimodal Cross Quantizer [8.069590683507997]
We propose MXQ-VAE, a vector quantization method for multimodal image-text representation.
MXQ-VAE accepts a paired image and text as input, and learns a joint quantized representation space.
We can use autoregressive generative models to model the joint image-text representation, and even perform unconditional image-text pair generation.
arXiv Detail & Related papers (2022-04-15T16:29:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.