Learning to Generate Semantic Layouts for Higher Text-Image
Correspondence in Text-to-Image Synthesis
- URL: http://arxiv.org/abs/2308.08157v1
- Date: Wed, 16 Aug 2023 05:59:33 GMT
- Title: Learning to Generate Semantic Layouts for Higher Text-Image
Correspondence in Text-to-Image Synthesis
- Authors: Minho Park, Jooyeol Yun, Seunghwan Choi, Jaegul Choo
- Abstract summary: We propose a novel approach for enhancing text-image correspondence by leveraging available semantic layouts.
Our approach achieves higher text-image correspondence compared to existing text-to-image generation approaches in the Multi-Modal CelebA-HQ and the Cityscapes dataset.
- Score: 37.32270579534541
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing text-to-image generation approaches have set high standards for
photorealism and text-image correspondence, largely benefiting from web-scale
text-image datasets, which can include up to 5~billion pairs. However,
text-to-image generation models trained on domain-specific datasets, such as
urban scenes, medical images, and faces, still suffer from low text-image
correspondence due to the lack of text-image pairs. Additionally, collecting
billions of text-image pairs for a specific domain can be time-consuming and
costly. Thus, ensuring high text-image correspondence without relying on
web-scale text-image datasets remains a challenging task. In this paper, we
present a novel approach for enhancing text-image correspondence by leveraging
available semantic layouts. Specifically, we propose a Gaussian-categorical
diffusion process that simultaneously generates both images and corresponding
layout pairs. Our experiments reveal that we can guide text-to-image generation
models to be aware of the semantics of different image regions, by training the
model to generate semantic labels for each pixel. We demonstrate that our
approach achieves higher text-image correspondence compared to existing
text-to-image generation approaches in the Multi-Modal CelebA-HQ and the
Cityscapes dataset, where text-image pairs are scarce. Codes are available in
this https://pmh9960.github.io/research/GCDP
Related papers
- Leveraging Unpaired Data for Vision-Language Generative Models via Cycle
Consistency [47.3163261953469]
Current vision-language generative models rely on expansive corpora of paired image-text data to attain optimal performance and generalization capabilities.
We introduce ITIT: an innovative training paradigm grounded in the concept of cycle consistency which allows vision-language training on unpaired image and text data.
ITIT is comprised of a joint image-text encoder with disjoint image and text decoders that enable bidirectional image-to-text and text-to-image generation in a single framework.
arXiv Detail & Related papers (2023-10-05T17:55:19Z) - Zero-shot spatial layout conditioning for text-to-image diffusion models [52.24744018240424]
Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modelling.
We consider image generation from text associated with segments on the image canvas, which combines an intuitive natural language interface with precise spatial control over the generated content.
We propose ZestGuide, a zero-shot segmentation guidance approach that can be plugged into pre-trained text-to-image diffusion models.
arXiv Detail & Related papers (2023-06-23T19:24:48Z) - TextDiffuser: Diffusion Models as Text Painters [118.30923824681642]
We introduce TextDiffuser, focusing on generating images with visually appealing text that is coherent with backgrounds.
We contribute the first large-scale text images dataset with OCR annotations, MARIO-10M, containing 10 million image-text pairs.
We show that TextDiffuser is flexible and controllable to create high-quality text images using text prompts alone or together with text template images, and conduct text inpainting to reconstruct incomplete images with text.
arXiv Detail & Related papers (2023-05-18T10:16:19Z) - GlyphDiffusion: Text Generation as Image Generation [100.98428068214736]
We propose GlyphDiffusion, a novel diffusion approach for text generation via text-guided image generation.
Our key idea is to render the target text as a glyph image containing visual language content.
Our model also makes significant improvements compared to the recent diffusion model.
arXiv Detail & Related papers (2023-04-25T02:14:44Z) - Unified Multi-Modal Latent Diffusion for Joint Subject and Text
Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences.
To be more specific, both input texts and images are encoded into one unified multi-modal latent space.
Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z) - Photorealistic Text-to-Image Diffusion Models with Deep Language
Understanding [53.170767750244366]
Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models.
arXiv Detail & Related papers (2022-05-23T17:42:53Z) - Unconditional Image-Text Pair Generation with Multimodal Cross Quantizer [8.069590683507997]
We propose MXQ-VAE, a vector quantization method for multimodal image-text representation.
MXQ-VAE accepts a paired image and text as input, and learns a joint quantized representation space.
We can use autoregressive generative models to model the joint image-text representation, and even perform unconditional image-text pair generation.
arXiv Detail & Related papers (2022-04-15T16:29:55Z) - TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions.
StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN.
visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space.
instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.