DiverGAN: An Efficient and Effective Single-Stage Framework for Diverse
Text-to-Image Generation
- URL: http://arxiv.org/abs/2111.09267v1
- Date: Wed, 17 Nov 2021 17:59:56 GMT
- Title: DiverGAN: An Efficient and Effective Single-Stage Framework for Diverse
Text-to-Image Generation
- Authors: Zhenxing Zhang and Lambert Schomaker
- Abstract summary: DiverGAN is a framework to generate diverse, plausible and semantically consistent images according to a natural-language description.
DiverGAN adopts two novel word-level attention modules, i.e., a channel-attention module (CAM) and a pixel-attention module (PAM)
Conditional Adaptive Instance-Layer Normalization (CAdaILN) is introduced to enable the linguistic cues from the sentence embedding to flexibly manipulate the amount of change in shape and texture.
- Score: 7.781425222538382
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present an efficient and effective single-stage framework
(DiverGAN) to generate diverse, plausible and semantically consistent images
according to a natural-language description. DiverGAN adopts two novel
word-level attention modules, i.e., a channel-attention module (CAM) and a
pixel-attention module (PAM), which model the importance of each word in the
given sentence while allowing the network to assign larger weights to the
significant channels and pixels semantically aligning with the salient words.
After that, Conditional Adaptive Instance-Layer Normalization (CAdaILN) is
introduced to enable the linguistic cues from the sentence embedding to
flexibly manipulate the amount of change in shape and texture, further
improving visual-semantic representation and helping stabilize the training.
Also, a dual-residual structure is developed to preserve more original visual
features while allowing for deeper networks, resulting in faster convergence
speed and more vivid details. Furthermore, we propose to plug a fully-connected
layer into the pipeline to address the lack-of-diversity problem, since we
observe that a dense layer will remarkably enhance the generative capability of
the network, balancing the trade-off between a low-dimensional random latent
code contributing to variants and modulation modules that use high-dimensional
and textual contexts to strength feature maps. Inserting a linear layer after
the second residual block achieves the best variety and quality. Both
qualitative and quantitative results on benchmark data sets demonstrate the
superiority of our DiverGAN for realizing diversity, without harming quality
and semantic consistency.
Related papers
- Multimodal generative semantic communication based on latent diffusion model [13.035207938169844]
This paper introduces a multimodal generative semantic communication framework named mm-GESCO.
The framework ingests streams of visible and infrared modal image data, generates fused semantic segmentation maps, and transmits them.
At the receiving end, the framework can reconstruct the original multimodal images based on the semantic maps.
arXiv Detail & Related papers (2024-08-10T06:23:41Z) - LVLM-empowered Multi-modal Representation Learning for Visual Place Recognition [17.388776062997813]
We try to build a discriminative global representations by fusing image data and text descriptions of the the visual scene.
The motivation is twofold: (1) Current Large Vision-Language Models (LVLMs) demonstrate extraordinary emergent capability in visual instruction following, and thus provide an efficient and flexible manner in generating text descriptions of images.
Although promising, leveraging LVLMs to build multi-modal VPR solutions remains challenging in efficient multi-modal fusion.
arXiv Detail & Related papers (2024-07-09T10:15:31Z) - Improving Diversity in Zero-Shot GAN Adaptation with Semantic Variations [61.132408427908175]
zero-shot GAN adaptation aims to reuse well-trained generators to synthesize images of an unseen target domain.
With only a single representative text feature instead of real images, the synthesized images gradually lose diversity.
We propose a novel method to find semantic variations of the target text in the CLIP space.
arXiv Detail & Related papers (2023-08-21T08:12:28Z) - Artificial-Spiking Hierarchical Networks for Vision-Language
Representation Learning [16.902924543372713]
State-of-the-art methods achieve impressive performance by pre-training on large-scale datasets.
We propose an efficient framework for multimodal alignment by introducing a novel visual semantic module.
Experiments show that the proposed ASH-Nets achieve competitive results.
arXiv Detail & Related papers (2023-08-18T10:40:25Z) - Object Segmentation by Mining Cross-Modal Semantics [68.88086621181628]
We propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features.
Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision.
arXiv Detail & Related papers (2023-05-17T14:30:11Z) - Semantic Image Synthesis via Diffusion Models [159.4285444680301]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks.
Recent work on semantic image synthesis mainly follows the emphde facto Generative Adversarial Nets (GANs)
arXiv Detail & Related papers (2022-06-30T18:31:51Z) - Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation [87.49579477873196]
We first design a two-stream encoder to extract CNN-based visual features and transformer-based linguistic features hierarchically.
A vision-language mutual guidance (VLMG) module is inserted into the encoder multiple times to promote the hierarchical and progressive fusion of multi-modal features.
In order to promote the temporal alignment between frames, we propose a language-guided multi-scale dynamic filtering (LMDF) module.
arXiv Detail & Related papers (2022-03-30T01:06:13Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z) - DTGAN: Dual Attention Generative Adversarial Networks for Text-to-Image
Generation [8.26410341981427]
The Dual Attention Generative Adversarial Network (DTGAN) can synthesize high-quality and semantically consistent images.
The proposed model introduces channel-aware and pixel-aware attention modules that can guide the generator to focus on text-relevant channels and pixels.
A new type of visual loss is utilized to enhance the image resolution by ensuring vivid shape and perceptually uniform color distributions of generated images.
arXiv Detail & Related papers (2020-11-05T08:57:15Z) - Improve Variational Autoencoder for Text Generationwith Discrete Latent
Bottleneck [52.08901549360262]
Variational autoencoders (VAEs) are essential tools in end-to-end representation learning.
VAEs tend to ignore latent variables with a strong auto-regressive decoder.
We propose a principled approach to enforce an implicit latent feature matching in a more compact latent space.
arXiv Detail & Related papers (2020-04-22T14:41:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.