The Nuts and Bolts of Adopting Transformer in GANs
- URL: http://arxiv.org/abs/2110.13107v3
- Date: Tue, 13 Jun 2023 15:07:15 GMT
- Title: The Nuts and Bolts of Adopting Transformer in GANs
- Authors: Rui Xu, Xiangyu Xu, Kai Chen, Bolei Zhou, Chen Change Loy
- Abstract summary: We investigate the properties of Transformer in the generative adversarial network (GAN) framework for high-fidelity image synthesis.
Our study leads to a new alternative design of Transformers in GAN, a convolutional neural network (CNN)-free generator termed as STrans-G.
- Score: 124.30856952272913
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer becomes prevalent in computer vision, especially for high-level
vision tasks. However, adopting Transformer in the generative adversarial
network (GAN) framework is still an open yet challenging problem. In this
paper, we conduct a comprehensive empirical study to investigate the properties
of Transformer in GAN for high-fidelity image synthesis. Our analysis
highlights and reaffirms the importance of feature locality in image
generation, although the merits of the locality are well known in the
classification task. Perhaps more interestingly, we find the residual
connections in self-attention layers harmful for learning Transformer-based
discriminators and conditional generators. We carefully examine the influence
and propose effective ways to mitigate the negative impacts. Our study leads to
a new alternative design of Transformers in GAN, a convolutional neural network
(CNN)-free generator termed as STrans-G, which achieves competitive results in
both unconditional and conditional image generations. The Transformer-based
discriminator, STrans-D, also significantly reduces its gap against the
CNN-based discriminators.
Related papers
- Transformer-based Generative Adversarial Networks in Computer Vision: A
Comprehensive Survey [26.114550071165628]
Generative Adversarial Networks (GANs) have been very successful for synthesizing the images in a given dataset.
Recent works have tried to exploit the Transformers in GAN framework for the image/video synthesis.
This paper presents a comprehensive survey on the developments and advancements in GANs utilizing the Transformer networks for computer vision applications.
arXiv Detail & Related papers (2023-02-17T01:13:58Z) - Towards Lightweight Transformer via Group-wise Transformation for
Vision-and-Language Tasks [126.33843752332139]
We introduce Group-wise Transformation towards a universal yet lightweight Transformer for vision-and-language tasks, termed as LW-Transformer.
We apply LW-Transformer to a set of Transformer-based networks, and quantitatively measure them on three vision-and-language tasks and six benchmark datasets.
Experimental results show that while saving a large number of parameters and computations, LW-Transformer achieves very competitive performance against the original Transformer networks for vision-and-language tasks.
arXiv Detail & Related papers (2022-04-16T11:30:26Z) - StyleSwin: Transformer-based GAN for High-resolution Image Generation [28.703687511694305]
We seek to explore using pure transformers to build a generative adversarial network for high-resolution image synthesis.
Proposed generator adopts Swin transformer in a style-based architecture.
We show that offering the knowledge of the absolute position that has been lost in window-based transformers greatly benefits the generation quality.
arXiv Detail & Related papers (2021-12-20T18:59:51Z) - ConvNets vs. Transformers: Whose Visual Representations are More
Transferable? [49.62201738334348]
We investigate the transfer learning ability of ConvNets and vision transformers in 15 single-task and multi-task performance evaluations.
We observe consistent advantages of Transformer-based backbones on 13 downstream tasks.
arXiv Detail & Related papers (2021-08-11T16:20:38Z) - Combining Transformer Generators with Convolutional Discriminators [9.83490307808789]
Recently proposed TransGAN is the first GAN using only transformer-based architectures.
TransGAN requires data augmentation, an auxiliary super-resolution task during training, and a masking prior to guide the self-attention mechanism.
We evaluate our approach by conducting a benchmark of well-known CNN discriminators, ablate the size of the transformer-based generator, and show that combining both architectural elements into a hybrid model leads to better results.
arXiv Detail & Related papers (2021-05-21T07:56:59Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - TransGAN: Two Transformers Can Make One Strong GAN [111.07699201175919]
We conduct the first pilot study in building a GAN textbfcompletely free of convolutions, using only pure transformer-based architectures.
Our vanilla GAN architecture, dubbed textbfTransGAN, consists of a memory-friendly transformer-based generator.
Our best architecture achieves highly competitive performance compared to current state-of-the-art GANs based on convolutional backbones.
arXiv Detail & Related papers (2021-02-14T05:24:48Z) - A Survey on Visual Transformer [126.56860258176324]
Transformer is a type of deep neural network mainly based on the self-attention mechanism.
In this paper, we review these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages.
arXiv Detail & Related papers (2020-12-23T09:37:54Z) - Toward Transformer-Based Object Detection [12.704056181392415]
Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results.
ViT-FRCNN demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance.
We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.
arXiv Detail & Related papers (2020-12-17T22:33:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.