Swinv2-Imagen: Hierarchical Vision Transformer Diffusion Models for
Text-to-Image Generation
- URL: http://arxiv.org/abs/2210.09549v1
- Date: Tue, 18 Oct 2022 02:50:34 GMT
- Title: Swinv2-Imagen: Hierarchical Vision Transformer Diffusion Models for
Text-to-Image Generation
- Authors: Ruijun Li, Weihua Li, Yi Yang, Hanyu Wei, Jianhua Jiang and Quan Bai
- Abstract summary: We propose a text-to-image diffusion model based on a Hierarchical Visual Transformer and a Scene Graph incorporating a semantic layout.
In the proposed model, the feature vectors of entities and relationships are extracted and involved in the diffusion model.
We also introduce a Swin-Transformer-based UNet architecture, called Swinv2-Unet, which can address the problems stemming from the CNN convolution operations.
- Score: 25.14323931233249
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, diffusion models have been proven to perform remarkably well in
text-to-image synthesis tasks in a number of studies, immediately presenting
new study opportunities for image generation. Google's Imagen follows this
research trend and outperforms DALLE2 as the best model for text-to-image
generation. However, Imagen merely uses a T5 language model for text
processing, which cannot ensure learning the semantic information of the text.
Furthermore, the Efficient UNet leveraged by Imagen is not the best choice in
image processing. To address these issues, we propose the Swinv2-Imagen, a
novel text-to-image diffusion model based on a Hierarchical Visual Transformer
and a Scene Graph incorporating a semantic layout. In the proposed model, the
feature vectors of entities and relationships are extracted and involved in the
diffusion model, effectively improving the quality of generated images. On top
of that, we also introduce a Swin-Transformer-based UNet architecture, called
Swinv2-Unet, which can address the problems stemming from the CNN convolution
operations. Extensive experiments are conducted to evaluate the performance of
the proposed model by using three real-world datasets, i.e., MSCOCO, CUB and
MM-CelebA-HQ. The experimental results show that the proposed Swinv2-Imagen
model outperforms several popular state-of-the-art methods.
Related papers
- Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models [54.052963634384945]
We introduce the Image Regeneration task to assess text-to-image models.
We use GPT4V to bridge the gap between the reference image and the text input for the T2I model.
We also present ImageRepainter framework to enhance the quality of generated images.
arXiv Detail & Related papers (2024-11-14T13:52:43Z) - DreamDistribution: Prompt Distribution Learning for Text-to-Image
Diffusion Models [53.17454737232668]
We introduce a solution that allows a pretrained T2I diffusion model to learn a set of soft prompts.
These prompts offer text-guided editing capabilities and additional flexibility in controlling variation and mixing between multiple distributions.
We also show the adaptability of the learned prompt distribution to other tasks, such as text-to-3D.
arXiv Detail & Related papers (2023-12-21T12:11:00Z) - DiffDis: Empowering Generative Diffusion Model with Cross-Modal
Discrimination Capability [75.9781362556431]
We propose DiffDis to unify the cross-modal generative and discriminative pretraining into one single framework under the diffusion process.
We show that DiffDis outperforms single-task models on both the image generation and the image-text discriminative tasks.
arXiv Detail & Related papers (2023-08-18T05:03:48Z) - LayoutDiffuse: Adapting Foundational Diffusion Models for
Layout-to-Image Generation [24.694298869398033]
Our method trains efficiently, generates images with both high perceptual quality and layout alignment.
Our method significantly outperforms other 10 generative models based on GANs, VQ-VAE, and diffusion models.
arXiv Detail & Related papers (2023-02-16T14:20:25Z) - Lafite2: Few-shot Text-to-Image Generation [132.14211027057766]
We propose a novel method for pre-training text-to-image generation model on image-only datasets.
It considers a retrieval-then-optimization procedure to synthesize pseudo text features.
It can be beneficial to a wide range of settings, including the few-shot, semi-supervised and fully-supervised learning.
arXiv Detail & Related papers (2022-10-25T16:22:23Z) - Implementing and Experimenting with Diffusion Models for Text-to-Image
Generation [0.0]
Two models, DALL-E 2 and Imagen, have demonstrated that highly photorealistic images could be generated from a simple textual description of an image.
Text-to-image models require exceptionally large amounts of computational resources to train, as well as handling huge datasets collected from the internet.
This thesis contributes by reviewing the different approaches and techniques used by these models, and then by proposing our own implementation of a text-to-image model.
arXiv Detail & Related papers (2022-09-22T12:03:33Z) - Photorealistic Text-to-Image Diffusion Models with Deep Language
Understanding [53.170767750244366]
Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models.
arXiv Detail & Related papers (2022-05-23T17:42:53Z) - Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences.
In this work, we repurpose such models to generate a descriptive text given an image at inference time.
The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.