DAE-GAN: Dynamic Aspect-aware GAN for Text-to-Image Synthesis
- URL: http://arxiv.org/abs/2108.12141v1
- Date: Fri, 27 Aug 2021 07:20:34 GMT
- Title: DAE-GAN: Dynamic Aspect-aware GAN for Text-to-Image Synthesis
- Authors: Shulan Ruan, Yong Zhang, Kun Zhang, Yanbo Fan, Fan Tang, Qi Liu,
Enhong Chen
- Abstract summary: We propose a Dynamic Aspect-awarE GAN (DAE-GAN) that represents text information comprehensively from multiple granularities, including sentence-level, word-level, and aspect-level.
Inspired by human learning behaviors, we develop a novel Aspect-aware Dynamic Re-drawer (ADR) for image refinement, in which an Attended Global Refinement (AGR) module and an Aspect-aware Local Refinement (ALR) module are alternately employed.
- Score: 55.788772366325105
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-image synthesis refers to generating an image from a given text
description, the key goal of which lies in photo realism and semantic
consistency. Previous methods usually generate an initial image with sentence
embedding and then refine it with fine-grained word embedding. Despite the
significant progress, the 'aspect' information (e.g., red eyes) contained in
the text, referring to several words rather than a word that depicts 'a
particular part or feature of something', is often ignored, which is highly
helpful for synthesizing image details. How to make better utilization of
aspect information in text-to-image synthesis still remains an unresolved
challenge. To address this problem, in this paper, we propose a Dynamic
Aspect-awarE GAN (DAE-GAN) that represents text information comprehensively
from multiple granularities, including sentence-level, word-level, and
aspect-level. Moreover, inspired by human learning behaviors, we develop a
novel Aspect-aware Dynamic Re-drawer (ADR) for image refinement, in which an
Attended Global Refinement (AGR) module and an Aspect-aware Local Refinement
(ALR) module are alternately employed. AGR utilizes word-level embedding to
globally enhance the previously generated image, while ALR dynamically employs
aspect-level embedding to refine image details from a local perspective.
Finally, a corresponding matching loss function is designed to ensure the
text-image semantic consistency at different levels. Extensive experiments on
two well-studied and publicly available datasets (i.e., CUB-200 and COCO)
demonstrate the superiority and rationality of our method.
Related papers
- Fine-grained Cross-modal Fusion based Refinement for Text-to-Image
Synthesis [12.954663420736782]
We propose a novel Fine-grained text-image Fusion based Generative Adversarial Networks, dubbed FF-GAN.
The FF-GAN consists of two modules: Fine-grained text-image Fusion Block (FF-Block) and Global Semantic Refinement (GSR)
arXiv Detail & Related papers (2023-02-17T05:44:05Z) - HGAN: Hierarchical Graph Alignment Network for Image-Text Retrieval [13.061063817876336]
We propose a novel Hierarchical Graph Alignment Network (HGAN) for image-text retrieval.
First, to capture the comprehensive multimodal features, we construct the feature graphs for the image and text modality respectively.
Then, a multi-granularity shared space is established with a designed Multi-granularity Feature Aggregation and Rearrangement (MFAR) module.
Finally, the ultimate image and text features are further refined through three-level similarity functions to achieve the hierarchical alignment.
arXiv Detail & Related papers (2022-12-16T05:08:52Z) - Plug-and-Play Diffusion Features for Text-Driven Image-to-Image
Translation [10.39028769374367]
We present a new framework that takes text-to-image synthesis to the realm of image-to-image translation.
Our method harnesses the power of a pre-trained text-to-image diffusion model to generate a new image that complies with the target text.
arXiv Detail & Related papers (2022-11-22T20:39:18Z) - HumanDiffusion: a Coarse-to-Fine Alignment Diffusion Framework for
Controllable Text-Driven Person Image Generation [73.3790833537313]
Controllable person image generation promotes a wide range of applications such as digital human interaction and virtual try-on.
We propose HumanDiffusion, a coarse-to-fine alignment diffusion framework, for text-driven person image generation.
arXiv Detail & Related papers (2022-11-11T14:30:34Z) - IR-GAN: Image Manipulation with Linguistic Instruction by Increment
Reasoning [110.7118381246156]
Increment Reasoning Generative Adversarial Network (IR-GAN) aims to reason consistency between visual increment in images and semantic increment in instructions.
First, we introduce the word-level and instruction-level instruction encoders to learn user's intention from history-correlated instructions as semantic increment.
Second, we embed the representation of semantic increment into that of source image for generating target image, where source image plays the role of referring auxiliary.
arXiv Detail & Related papers (2022-04-02T07:48:39Z) - Self-Supervised Image-to-Text and Text-to-Image Synthesis [23.587581181330123]
We propose a novel self-supervised deep learning based approach towards learning the cross-modal embedding spaces.
In our approach, we first obtain dense vector representations of images using StackGAN-based autoencoder model and also dense vector representations on sentence-level utilizing LSTM based text-autoencoder.
arXiv Detail & Related papers (2021-12-09T13:54:56Z) - Cycle-Consistent Inverse GAN for Text-to-Image Synthesis [101.97397967958722]
We propose a novel unified framework of Cycle-consistent Inverse GAN for both text-to-image generation and text-guided image manipulation tasks.
We learn a GAN inversion model to convert the images back to the GAN latent space and obtain the inverted latent codes for each image.
In the text-guided optimization module, we generate images with the desired semantic attributes by optimizing the inverted latent codes.
arXiv Detail & Related papers (2021-08-03T08:38:16Z) - TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions.
StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN.
visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space.
instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z) - DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis [80.54273334640285]
We propose a novel one-stage text-to-image backbone that directly synthesizes high-resolution images without entanglements between different generators.
We also propose a novel Target-Aware Discriminator composed of Matching-Aware Gradient Penalty and One-Way Output.
Compared with current state-of-the-art methods, our proposed DF-GAN is simpler but more efficient to synthesize realistic and text-matching images.
arXiv Detail & Related papers (2020-08-13T12:51:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.