CAGAN: Text-To-Image Generation with Combined Attention GANs
- URL: http://arxiv.org/abs/2104.12663v1
- Date: Mon, 26 Apr 2021 15:46:40 GMT
- Title: CAGAN: Text-To-Image Generation with Combined Attention GANs
- Authors: Henning Schulze and Dogucan Yaman and Alexander Waibel
- Abstract summary: We propose the Combined Attention Generative Adversarial Network (CAGAN) to generate photo-realistic images according to textual descriptions.
The proposed CAGAN uses two attention models: word attention to draw different sub-regions conditioned on related words; and squeeze-and-excitation attention to capture non-linear interaction among channels.
With spectral normalisation to stabilise training, our proposed CAGAN improves the state of the art on the IS and FID on the CUB dataset and the FID on the more challenging COCO dataset.
- Score: 70.3497683558609
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating images according to natural language descriptions is a challenging
task. In this work, we propose the Combined Attention Generative Adversarial
Network (CAGAN) to generate photo-realistic images according to textual
descriptions. The proposed CAGAN utilises two attention models: word attention
to draw different sub-regions conditioned on related words; and
squeeze-and-excitation attention to capture non-linear interaction among
channels. With spectral normalisation to stabilise training, our proposed CAGAN
improves the state of the art on the IS and FID on the CUB dataset and the FID
on the more challenging COCO dataset. Furthermore, we demonstrate that judging
a model by a single evaluation metric can be misleading by developing an
additional model adding local self-attention which scores a higher IS,
outperforming the state of the art on the CUB dataset, but generates
unrealistic images through feature repetition.
Related papers
- Enhancing Conditional Image Generation with Explainable Latent Space Manipulation [0.0]
This paper proposes a novel approach to achieve fidelity to a reference image while adhering to conditional prompts.
We analyze the cross attention maps of the cross attention layers and gradients for the denoised latent vector.
Using this information, we create masks at specific timesteps during denoising to preserve subjects while seamlessly integrating the reference image features.
arXiv Detail & Related papers (2024-08-29T03:12:04Z) - CIC-BART-SSA: Controllable Image Captioning with Structured Semantic Augmentation [9.493755431645313]
We propose a novel, fully automatic method to sample additional focused and visually grounded captions.
We leverage Abstract Meaning Representation (AMR) to encode all possible semantic-semantic relations between entities.
We then develop a new model, CIC-BART-SSA, that sources its control signals from SSA-diversified datasets.
arXiv Detail & Related papers (2024-07-16T05:26:12Z) - Dual-Image Enhanced CLIP for Zero-Shot Anomaly Detection [58.228940066769596]
We introduce a Dual-Image Enhanced CLIP approach, leveraging a joint vision-language scoring system.
Our methods process pairs of images, utilizing each as a visual reference for the other, thereby enriching the inference process with visual context.
Our approach significantly exploits the potential of vision-language joint anomaly detection and demonstrates comparable performance with current SOTA methods across various datasets.
arXiv Detail & Related papers (2024-05-08T03:13:20Z) - Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing.
Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery.
We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z) - Localized Text-to-Image Generation for Free via Cross Attention Control [154.06530917754515]
We show that localized generation can be achieved by simply controlling cross attention maps during inference.
Our proposed cross attention control (CAC) provides new open-vocabulary localization abilities to standard text-to-image models.
arXiv Detail & Related papers (2023-06-26T12:15:06Z) - Controllable Image Generation via Collage Representations [31.456445433105415]
"Mixing and matching scenes" (M&Ms) is an approach that consists of an adversarially trained generative image model conditioned on appearance features and spatial positions of the different elements in a collage.
We show that M&Ms outperforms baselines in terms of fine-grained scene controllability while being very competitive in terms of image quality and sample diversity.
arXiv Detail & Related papers (2023-04-26T17:58:39Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - SegAttnGAN: Text to Image Generation with Segmentation Attention [6.561007033994183]
We propose a novel generative network (SegAttnGAN) that utilizes additional segmentation information for the text-to-image synthesis task.
As the segmentation data introduced to the model provides useful guidance on the generator training, the proposed model can generate images with better realism quality.
arXiv Detail & Related papers (2020-05-25T23:56:41Z) - High-Order Information Matters: Learning Relation and Topology for
Occluded Person Re-Identification [84.43394420267794]
We propose a novel framework by learning high-order relation and topology information for discriminative features and robust alignment.
Our framework significantly outperforms state-of-the-art by6.5%mAP scores on Occluded-Duke dataset.
arXiv Detail & Related papers (2020-03-18T12:18:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.