DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis
- URL: http://arxiv.org/abs/2008.05865v4
- Date: Sat, 15 Oct 2022 03:51:50 GMT
- Title: DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis
- Authors: Ming Tao, Hao Tang, Fei Wu, Xiao-Yuan Jing, Bing-Kun Bao, Changsheng
Xu
- Abstract summary: We propose a novel one-stage text-to-image backbone that directly synthesizes high-resolution images without entanglements between different generators.
We also propose a novel Target-Aware Discriminator composed of Matching-Aware Gradient Penalty and One-Way Output.
Compared with current state-of-the-art methods, our proposed DF-GAN is simpler but more efficient to synthesize realistic and text-matching images.
- Score: 80.54273334640285
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Synthesizing high-quality realistic images from text descriptions is a
challenging task. Existing text-to-image Generative Adversarial Networks
generally employ a stacked architecture as the backbone yet still remain three
flaws. First, the stacked architecture introduces the entanglements between
generators of different image scales. Second, existing studies prefer to apply
and fix extra networks in adversarial learning for text-image semantic
consistency, which limits the supervision capability of these networks. Third,
the cross-modal attention-based text-image fusion that widely adopted by
previous works is limited on several special image scales because of the
computational cost. To these ends, we propose a simpler but more effective Deep
Fusion Generative Adversarial Networks (DF-GAN). To be specific, we propose:
(i) a novel one-stage text-to-image backbone that directly synthesizes
high-resolution images without entanglements between different generators, (ii)
a novel Target-Aware Discriminator composed of Matching-Aware Gradient Penalty
and One-Way Output, which enhances the text-image semantic consistency without
introducing extra networks, (iii) a novel deep text-image fusion block, which
deepens the fusion process to make a full fusion between text and visual
features. Compared with current state-of-the-art methods, our proposed DF-GAN
is simpler but more efficient to synthesize realistic and text-matching images
and achieves better performance on widely used datasets.
Related papers
- Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval [11.798006331912056]
The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions.
We propose a novel TIPR framework to build fine-grained interactions and alignment between person images and the corresponding texts.
arXiv Detail & Related papers (2023-07-18T08:23:46Z) - Unified Multi-Modal Latent Diffusion for Joint Subject and Text
Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences.
To be more specific, both input texts and images are encoded into one unified multi-modal latent space.
Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z) - Fine-grained Cross-modal Fusion based Refinement for Text-to-Image
Synthesis [12.954663420736782]
We propose a novel Fine-grained text-image Fusion based Generative Adversarial Networks, dubbed FF-GAN.
The FF-GAN consists of two modules: Fine-grained text-image Fusion Block (FF-Block) and Global Semantic Refinement (GSR)
arXiv Detail & Related papers (2023-02-17T05:44:05Z) - HumanDiffusion: a Coarse-to-Fine Alignment Diffusion Framework for
Controllable Text-Driven Person Image Generation [73.3790833537313]
Controllable person image generation promotes a wide range of applications such as digital human interaction and virtual try-on.
We propose HumanDiffusion, a coarse-to-fine alignment diffusion framework, for text-driven person image generation.
arXiv Detail & Related papers (2022-11-11T14:30:34Z) - Towards Better Text-Image Consistency in Text-to-Image Generation [15.735515302139335]
We develop a novel CLIP-based metric termed as Semantic Similarity Distance (SSD)
We further design the Parallel Deep Fusion Generative Adversarial Networks (PDF-GAN), which can fuse semantic information at different granularities.
Our PDF-GAN can lead to significantly better text-image consistency while maintaining decent image quality on the CUB and COCO datasets.
arXiv Detail & Related papers (2022-10-27T07:47:47Z) - Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors [58.71128866226768]
Recent text-to-image generation methods have incrementally improved the generated image fidelity and text relevancy.
We propose a novel text-to-image method that addresses these gaps by (i) enabling a simple control mechanism complementary to text in the form of a scene.
Our model achieves state-of-the-art FID and human evaluation results, unlocking the ability to generate high fidelity images in a resolution of 512x512 pixels.
arXiv Detail & Related papers (2022-03-24T15:44:50Z) - Self-Supervised Image-to-Text and Text-to-Image Synthesis [23.587581181330123]
We propose a novel self-supervised deep learning based approach towards learning the cross-modal embedding spaces.
In our approach, we first obtain dense vector representations of images using StackGAN-based autoencoder model and also dense vector representations on sentence-level utilizing LSTM based text-autoencoder.
arXiv Detail & Related papers (2021-12-09T13:54:56Z) - Text to Image Generation with Semantic-Spatial Aware GAN [41.73685713621705]
A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions.
We propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information.
arXiv Detail & Related papers (2021-04-01T15:48:01Z) - TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions.
StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN.
visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space.
instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z) - Scene Text Synthesis for Efficient and Effective Deep Network Training [62.631176120557136]
We develop an innovative image synthesis technique that composes annotated training images by embedding foreground objects of interest into background images.
The proposed technique consists of two key components that in principle boost the usefulness of the synthesized images in deep network training.
Experiments over a number of public datasets demonstrate the effectiveness of our proposed image synthesis technique.
arXiv Detail & Related papers (2019-01-26T10:15:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.