Shifted Diffusion for Text-to-image Generation
- URL: http://arxiv.org/abs/2211.15388v2
- Date: Thu, 23 Mar 2023 22:21:20 GMT
- Title: Shifted Diffusion for Text-to-image Generation
- Authors: Yufan Zhou, Bingchen Liu, Yizhe Zhu, Xiao Yang, Changyou Chen, Jinhui
Xu
- Abstract summary: Corgi is based on our proposed shifted diffusion model, which achieves better image embedding generation from input text.
Corgi also achieves new state-of-the-art results across different datasets on downstream language-free text-to-image generation tasks.
- Score: 65.53758187995744
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Corgi, a novel method for text-to-image generation. Corgi is based
on our proposed shifted diffusion model, which achieves better image embedding
generation from input text. Unlike the baseline diffusion model used in DALL-E
2, our method seamlessly encodes prior knowledge of the pre-trained CLIP model
in its diffusion process by designing a new initialization distribution and a
new transition step of the diffusion. Compared to the strong DALL-E 2 baseline,
our method performs better in generating image embedding from the text in terms
of both efficiency and effectiveness, resulting in better text-to-image
generation. Extensive large-scale experiments are conducted and evaluated in
terms of both quantitative measures and human evaluation, indicating a stronger
generation ability of our method compared to existing ones. Furthermore, our
model enables semi-supervised and language-free training for text-to-image
generation, where only part or none of the images in the training dataset have
an associated caption. Trained with only 1.7% of the images being captioned,
our semi-supervised model obtains FID results comparable to DALL-E 2 on
zero-shot text-to-image generation evaluated on MS-COCO. Corgi also achieves
new state-of-the-art results across different datasets on downstream
language-free text-to-image generation tasks, outperforming the previous
method, Lafite, by a large margin.
Related papers
- Refining Text-to-Image Generation: Towards Accurate Training-Free Glyph-Enhanced Image Generation [5.55027585813848]
The capability to generate visual text is crucial, offering both academic interest and a wide range of practical applications.
We introduce a benchmark, LenCom-Eval, specifically designed for testing models' capability in generating images with Lengthy and Complex visual text.
We demonstrate notable improvements across a range of evaluation metrics, including CLIPScore, OCR precision, recall, F1 score, accuracy, and edit distance scores.
arXiv Detail & Related papers (2024-03-25T04:54:49Z) - From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models [38.14123683674355]
We propose a method to utilize the attention mechanism in the denoising network of text-to-image diffusion models.
We evaluate our method on Pascal VOC 2012 and Microsoft COCO 2014 under weakly-supervised semantic segmentation setting.
Our work reveals a novel way to extract the rich multi-modal knowledge hidden in diffusion models for segmentation.
arXiv Detail & Related papers (2023-09-08T04:10:01Z) - Variational Distribution Learning for Unsupervised Text-to-Image
Generation [42.3246826401366]
We propose a text-to-image generation algorithm based on deep neural networks when text captions for images are unavailable during training.
We employ a pretrained CLIP model, which is capable of properly aligning embeddings of images and corresponding texts in a joint space.
We optimize a text-to-image generation model by maximizing the data log-likelihood conditioned on pairs of image-text CLIP embeddings.
arXiv Detail & Related papers (2023-03-28T16:18:56Z) - Unleashing Text-to-Image Diffusion Models for Visual Perception [84.41514649568094]
VPD (Visual Perception with a pre-trained diffusion model) is a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks.
We show that VPD can be faster adapted to downstream visual perception tasks using the proposed VPD.
arXiv Detail & Related papers (2023-03-03T18:59:47Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z) - Lafite2: Few-shot Text-to-Image Generation [132.14211027057766]
We propose a novel method for pre-training text-to-image generation model on image-only datasets.
It considers a retrieval-then-optimization procedure to synthesize pseudo text features.
It can be beneficial to a wide range of settings, including the few-shot, semi-supervised and fully-supervised learning.
arXiv Detail & Related papers (2022-10-25T16:22:23Z) - Photorealistic Text-to-Image Diffusion Models with Deep Language
Understanding [53.170767750244366]
Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models.
arXiv Detail & Related papers (2022-05-23T17:42:53Z) - LAFITE: Towards Language-Free Training for Text-to-Image Generation [83.2935513540494]
We propose the first work to train text-to-image generation models without any text data.
Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model.
We obtain state-of-the-art results in the standard text-to-image generation tasks.
arXiv Detail & Related papers (2021-11-27T01:54:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.