Photorealistic Text-to-Image Diffusion Models with Deep Language
Understanding
- URL: http://arxiv.org/abs/2205.11487v1
- Date: Mon, 23 May 2022 17:42:53 GMT
- Title: Photorealistic Text-to-Image Diffusion Models with Deep Language
Understanding
- Authors: Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang,
Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara
Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet,
Mohammad Norouzi
- Abstract summary: Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models.
- Score: 53.170767750244366
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Imagen, a text-to-image diffusion model with an unprecedented
degree of photorealism and a deep level of language understanding. Imagen
builds on the power of large transformer language models in understanding text
and hinges on the strength of diffusion models in high-fidelity image
generation. Our key discovery is that generic large language models (e.g. T5),
pretrained on text-only corpora, are surprisingly effective at encoding text
for image synthesis: increasing the size of the language model in Imagen boosts
both sample fidelity and image-text alignment much more than increasing the
size of the image diffusion model. Imagen achieves a new state-of-the-art FID
score of 7.27 on the COCO dataset, without ever training on COCO, and human
raters find Imagen samples to be on par with the COCO data itself in image-text
alignment. To assess text-to-image models in greater depth, we introduce
DrawBench, a comprehensive and challenging benchmark for text-to-image models.
With DrawBench, we compare Imagen with recent methods including VQ-GAN+CLIP,
Latent Diffusion Models, and DALL-E 2, and find that human raters prefer Imagen
over other models in side-by-side comparisons, both in terms of sample quality
and image-text alignment. See https://imagen.research.google/ for an overview
of the results.
Related papers
- UDiffText: A Unified Framework for High-quality Text Synthesis in
Arbitrary Images via Character-aware Diffusion Models [25.219960711604728]
This paper proposes a novel approach for text image generation, utilizing a pre-trained diffusion model.
Our approach involves the design and training of a light-weight character-level text encoder, which replaces the original CLIP encoder.
By employing an inference stage refinement process, we achieve a notably high sequence accuracy when synthesizing text in arbitrarily given images.
arXiv Detail & Related papers (2023-12-08T07:47:46Z) - Paragraph-to-Image Generation with Information-Enriched Diffusion Model [67.9265336953134]
ParaDiffusion is an information-enriched diffusion model for paragraph-to-image generation task.
It delves into the transference of the extensive semantic comprehension capabilities of large language models to the task of image generation.
The code and dataset will be released to foster community research on long-text alignment.
arXiv Detail & Related papers (2023-11-24T05:17:01Z) - Leveraging Unpaired Data for Vision-Language Generative Models via Cycle
Consistency [47.3163261953469]
Current vision-language generative models rely on expansive corpora of paired image-text data to attain optimal performance and generalization capabilities.
We introduce ITIT: an innovative training paradigm grounded in the concept of cycle consistency which allows vision-language training on unpaired image and text data.
ITIT is comprised of a joint image-text encoder with disjoint image and text decoders that enable bidirectional image-to-text and text-to-image generation in a single framework.
arXiv Detail & Related papers (2023-10-05T17:55:19Z) - Shifted Diffusion for Text-to-image Generation [65.53758187995744]
Corgi is based on our proposed shifted diffusion model, which achieves better image embedding generation from input text.
Corgi also achieves new state-of-the-art results across different datasets on downstream language-free text-to-image generation tasks.
arXiv Detail & Related papers (2022-11-24T03:25:04Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z) - Swinv2-Imagen: Hierarchical Vision Transformer Diffusion Models for
Text-to-Image Generation [25.14323931233249]
We propose a text-to-image diffusion model based on a Hierarchical Visual Transformer and a Scene Graph incorporating a semantic layout.
In the proposed model, the feature vectors of entities and relationships are extracted and involved in the diffusion model.
We also introduce a Swin-Transformer-based UNet architecture, called Swinv2-Unet, which can address the problems stemming from the CNN convolution operations.
arXiv Detail & Related papers (2022-10-18T02:50:34Z) - GLIDE: Towards Photorealistic Image Generation and Editing with
Text-Guided Diffusion Models [16.786221846896108]
We explore diffusion models for the problem of text-conditional image synthesis and compare two different guidance strategies.
We find that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples.
Our models can be fine-tuned to perform image inpainting, enabling powerful text-driven image editing.
arXiv Detail & Related papers (2021-12-20T18:42:55Z) - LAFITE: Towards Language-Free Training for Text-to-Image Generation [83.2935513540494]
We propose the first work to train text-to-image generation models without any text data.
Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model.
We obtain state-of-the-art results in the standard text-to-image generation tasks.
arXiv Detail & Related papers (2021-11-27T01:54:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.