DSE-GAN: Dynamic Semantic Evolution Generative Adversarial Network for
Text-to-Image Generation
- URL: http://arxiv.org/abs/2209.01339v1
- Date: Sat, 3 Sep 2022 06:13:26 GMT
- Title: DSE-GAN: Dynamic Semantic Evolution Generative Adversarial Network for
Text-to-Image Generation
- Authors: Mengqi Huang, Zhendong Mao, Penghui Wang, Quan Wang, Yongdong Zhang
- Abstract summary: We propose a novel Dynamical Semantic Evolution GAN (DSE-GAN) to re-compose each stage's text features under a novel single adversarial multi-stage architecture.
DSE-GAN achieves 7.48% and 37.8% relative FID improvement on two widely used benchmarks.
- Score: 71.87682778102236
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-image generation aims at generating realistic images which are
semantically consistent with the given text. Previous works mainly adopt the
multi-stage architecture by stacking generator-discriminator pairs to engage
multiple adversarial training, where the text semantics used to provide
generation guidance remain static across all stages. This work argues that text
features at each stage should be adaptively re-composed conditioned on the
status of the historical stage (i.e., historical stage's text and image
features) to provide diversified and accurate semantic guidance during the
coarse-to-fine generation process. We thereby propose a novel Dynamical
Semantic Evolution GAN (DSE-GAN) to re-compose each stage's text features under
a novel single adversarial multi-stage architecture. Specifically, we design
(1) Dynamic Semantic Evolution (DSE) module, which first aggregates historical
image features to summarize the generative feedback, and then dynamically
selects words required to be re-composed at each stage as well as re-composed
them by dynamically enhancing or suppressing different granularity subspace's
semantics. (2) Single Adversarial Multi-stage Architecture (SAMA), which
extends the previous structure by eliminating complicated multiple adversarial
training requirements and therefore allows more stages of text-image
interactions, and finally facilitates the DSE module. We conduct comprehensive
experiments and show that DSE-GAN achieves 7.48\% and 37.8\% relative FID
improvement on two widely used benchmarks, i.e., CUB-200 and MSCOCO,
respectively.
Related papers
- Story Visualization by Online Text Augmentation with Context Memory [64.86944645907771]
We propose a novel memory architecture for the Bi-directional Transformer framework with an online text augmentation.
The proposed method significantly outperforms the state of the arts in various metrics including FID, character F1, frame accuracy, BLEU-2/3, and R-precision.
arXiv Detail & Related papers (2023-08-15T05:08:12Z) - Learning to Model Multimodal Semantic Alignment for Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story.
Current works face the problem of semantic misalignment because of their fixed architecture and diversity of input modalities.
We explore the semantic alignment between text and image representations by learning to match their semantic levels in the GAN-based generative model.
arXiv Detail & Related papers (2022-11-14T11:41:44Z) - ERNIE-ViLG: Unified Generative Pre-training for Bidirectional
Vision-Language Generation [22.47279425592133]
We propose ERNIE-ViLG, a unified generative pre-training framework for bidirectional image-text generation.
For the text-to-image generation process, we propose an end-to-end training method to jointly learn the visual sequence generator and the image reconstructor.
We train a 10-billion parameter ERNIE-ViLG model on a large-scale dataset of 145 million (Chinese) image-text pairs.
arXiv Detail & Related papers (2021-12-31T03:53:33Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Cycle-Consistent Inverse GAN for Text-to-Image Synthesis [101.97397967958722]
We propose a novel unified framework of Cycle-consistent Inverse GAN for both text-to-image generation and text-guided image manipulation tasks.
We learn a GAN inversion model to convert the images back to the GAN latent space and obtain the inverted latent codes for each image.
In the text-guided optimization module, we generate images with the desired semantic attributes by optimizing the inverted latent codes.
arXiv Detail & Related papers (2021-08-03T08:38:16Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z) - Text to Image Generation with Semantic-Spatial Aware GAN [41.73685713621705]
A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions.
We propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information.
arXiv Detail & Related papers (2021-04-01T15:48:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.