DTGAN: Dual Attention Generative Adversarial Networks for Text-to-Image
Generation
- URL: http://arxiv.org/abs/2011.02709v3
- Date: Sat, 21 Nov 2020 23:59:25 GMT
- Title: DTGAN: Dual Attention Generative Adversarial Networks for Text-to-Image
Generation
- Authors: Zhenxing Zhang and Lambert Schomaker
- Abstract summary: The Dual Attention Generative Adversarial Network (DTGAN) can synthesize high-quality and semantically consistent images.
The proposed model introduces channel-aware and pixel-aware attention modules that can guide the generator to focus on text-relevant channels and pixels.
A new type of visual loss is utilized to enhance the image resolution by ensuring vivid shape and perceptually uniform color distributions of generated images.
- Score: 8.26410341981427
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most existing text-to-image generation methods adopt a multi-stage modular
architecture which has three significant problems: 1) Training multiple
networks increases the run time and affects the convergence and stability of
the generative model; 2) These approaches ignore the quality of early-stage
generator images; 3) Many discriminators need to be trained. To this end, we
propose the Dual Attention Generative Adversarial Network (DTGAN) which can
synthesize high-quality and semantically consistent images only employing a
single generator/discriminator pair. The proposed model introduces
channel-aware and pixel-aware attention modules that can guide the generator to
focus on text-relevant channels and pixels based on the global sentence vector
and to fine-tune original feature maps using attention weights. Also,
Conditional Adaptive Instance-Layer Normalization (CAdaILN) is presented to
help our attention modules flexibly control the amount of change in shape and
texture by the input natural-language description. Furthermore, a new type of
visual loss is utilized to enhance the image resolution by ensuring vivid shape
and perceptually uniform color distributions of generated images. Experimental
results on benchmark datasets demonstrate the superiority of our proposed
method compared to the state-of-the-art models with a multi-stage framework.
Visualization of the attention maps shows that the channel-aware attention
module is able to localize the discriminative regions, while the pixel-aware
attention module has the ability to capture the globally visual contents for
the generation of an image.
Related papers
- HiTSR: A Hierarchical Transformer for Reference-based Super-Resolution [6.546896650921257]
We propose HiTSR, a hierarchical transformer model for reference-based image super-resolution.
We streamline the architecture and training pipeline by incorporating the double attention block from GAN literature.
Our model demonstrates superior performance across three datasets including SUN80, Urban100, and Manga109.
arXiv Detail & Related papers (2024-08-30T01:16:29Z) - Unifying Visual and Semantic Feature Spaces with Diffusion Models for Enhanced Cross-Modal Alignment [20.902935570581207]
We introduce a Multimodal Alignment and Reconstruction Network (MARNet) to enhance the model's resistance to visual noise.
MARNet includes a cross-modal diffusion reconstruction module for smoothly and stably blending information across different domains.
Experiments conducted on two benchmark datasets, Vireo-Food172 and Ingredient-101, demonstrate that MARNet effectively improves the quality of image information extracted by the model.
arXiv Detail & Related papers (2024-07-26T16:30:18Z) - RefDrop: Controllable Consistency in Image or Video Generation via Reference Feature Guidance [22.326405355520176]
RefDrop allows users to control the influence of reference context in a direct and precise manner.
Our method also enables more interesting applications, such as the consistent generation of multiple subjects.
arXiv Detail & Related papers (2024-05-27T21:23:20Z) - R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image
Generation [74.5598315066249]
We probe into zero-shot grounded T2I generation with diffusion models.
We propose a Region and Boundary (R&B) aware cross-attention guidance approach.
arXiv Detail & Related papers (2023-10-13T05:48:42Z) - Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and
Latent Diffusion [50.59261592343479]
We present Kandinsky1, a novel exploration of latent diffusion architecture.
The proposed model is trained separately to map text embeddings to image embeddings of CLIP.
We also deployed a user-friendly demo system that supports diverse generative modes such as text-to-image generation, image fusion, text and image fusion, image variations generation, and text-guided inpainting/outpainting.
arXiv Detail & Related papers (2023-10-05T12:29:41Z) - Unified Frequency-Assisted Transformer Framework for Detecting and
Grounding Multi-Modal Manipulation [109.1912721224697]
We present the Unified Frequency-Assisted transFormer framework, named UFAFormer, to address the DGM4 problem.
By leveraging the discrete wavelet transform, we decompose images into several frequency sub-bands, capturing rich face forgery artifacts.
Our proposed frequency encoder, incorporating intra-band and inter-band self-attentions, explicitly aggregates forgery features within and across diverse sub-bands.
arXiv Detail & Related papers (2023-09-18T11:06:42Z) - Cross-View Panorama Image Synthesis [68.35351563852335]
PanoGAN is a novel adversarial feedback GAN framework named.
PanoGAN enables high-quality panorama image generation with more convincing details than state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-22T15:59:44Z) - DiverGAN: An Efficient and Effective Single-Stage Framework for Diverse
Text-to-Image Generation [7.781425222538382]
DiverGAN is a framework to generate diverse, plausible and semantically consistent images according to a natural-language description.
DiverGAN adopts two novel word-level attention modules, i.e., a channel-attention module (CAM) and a pixel-attention module (PAM)
Conditional Adaptive Instance-Layer Normalization (CAdaILN) is introduced to enable the linguistic cues from the sentence embedding to flexibly manipulate the amount of change in shape and texture.
arXiv Detail & Related papers (2021-11-17T17:59:56Z) - Towards Unsupervised Deep Image Enhancement with Generative Adversarial
Network [92.01145655155374]
We present an unsupervised image enhancement generative network (UEGAN)
It learns the corresponding image-to-image mapping from a set of images with desired characteristics in an unsupervised manner.
Results show that the proposed model effectively improves the aesthetic quality of images.
arXiv Detail & Related papers (2020-12-30T03:22:46Z) - Multi-Channel Attention Selection GANs for Guided Image-to-Image
Translation [148.9985519929653]
We propose a novel model named Multi-Channel Attention Selection Generative Adversarial Network (SelectionGAN) for guided image-to-image translation.
The proposed framework and modules are unified solutions and can be applied to solve other generation tasks such as semantic image synthesis.
arXiv Detail & Related papers (2020-02-03T23:17:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.