TransGAN: Two Transformers Can Make One Strong GAN
- URL: http://arxiv.org/abs/2102.07074v2
- Date: Tue, 16 Feb 2021 05:51:12 GMT
- Title: TransGAN: Two Transformers Can Make One Strong GAN
- Authors: Yifan Jiang, Shiyu Chang, Zhangyang Wang
- Abstract summary: We conduct the first pilot study in building a GAN textbfcompletely free of convolutions, using only pure transformer-based architectures.
Our vanilla GAN architecture, dubbed textbfTransGAN, consists of a memory-friendly transformer-based generator.
Our best architecture achieves highly competitive performance compared to current state-of-the-art GANs based on convolutional backbones.
- Score: 111.07699201175919
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent explosive interest on transformers has suggested their potential
to become powerful "universal" models for computer vision tasks, such as
classification, detection, and segmentation. However, how further transformers
can go - are they ready to take some more notoriously difficult vision tasks,
e.g., generative adversarial networks (GANs)? Driven by that curiosity, we
conduct the first pilot study in building a GAN \textbf{completely free of
convolutions}, using only pure transformer-based architectures. Our vanilla GAN
architecture, dubbed \textbf{TransGAN}, consists of a memory-friendly
transformer-based generator that progressively increases feature resolution
while decreasing embedding dimension, and a patch-level discriminator that is
also transformer-based. We then demonstrate TransGAN to notably benefit from
data augmentations (more than standard GANs), a multi-task co-training strategy
for the generator, and a locally initialized self-attention that emphasizes the
neighborhood smoothness of natural images. Equipped with those findings,
TransGAN can effectively scale up with bigger models and high-resolution image
datasets. Specifically, our best architecture achieves highly competitive
performance compared to current state-of-the-art GANs based on convolutional
backbones. Specifically, TransGAN sets \textbf{new state-of-the-art} IS score
of 10.10 and FID score of 25.32 on STL-10. It also reaches competitive 8.64 IS
score and 11.89 FID score on Cifar-10, and 12.23 FID score on CelebA
$64\times64$, respectively. We also conclude with a discussion of the current
limitations and future potential of TransGAN. The code is available at
\url{https://github.com/VITA-Group/TransGAN}.
Related papers
- TransVG++: End-to-End Visual Grounding with Language Conditioned Vision
Transformer [188.00681648113223]
We explore neat yet effective Transformer-based frameworks for visual grounding.
TransVG establishes multi-modal correspondences by Transformers and localizes referred regions by directly regressing box coordinates.
We upgrade our framework to a purely Transformer-based one by leveraging Vision Transformer (ViT) for vision feature encoding.
arXiv Detail & Related papers (2022-06-14T06:27:38Z) - StyleSwin: Transformer-based GAN for High-resolution Image Generation [28.703687511694305]
We seek to explore using pure transformers to build a generative adversarial network for high-resolution image synthesis.
Proposed generator adopts Swin transformer in a style-based architecture.
We show that offering the knowledge of the absolute position that has been lost in window-based transformers greatly benefits the generation quality.
arXiv Detail & Related papers (2021-12-20T18:59:51Z) - The Nuts and Bolts of Adopting Transformer in GANs [124.30856952272913]
We investigate the properties of Transformer in the generative adversarial network (GAN) framework for high-fidelity image synthesis.
Our study leads to a new alternative design of Transformers in GAN, a convolutional neural network (CNN)-free generator termed as STrans-G.
arXiv Detail & Related papers (2021-10-25T17:01:29Z) - Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions.
When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z) - Combining Transformer Generators with Convolutional Discriminators [9.83490307808789]
Recently proposed TransGAN is the first GAN using only transformer-based architectures.
TransGAN requires data augmentation, an auxiliary super-resolution task during training, and a masking prior to guide the self-attention mechanism.
We evaluate our approach by conducting a benchmark of well-known CNN discriminators, ablate the size of the transformer-based generator, and show that combining both architectural elements into a hybrid model leads to better results.
arXiv Detail & Related papers (2021-05-21T07:56:59Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - Toward Transformer-Based Object Detection [12.704056181392415]
Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results.
ViT-FRCNN demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance.
We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.
arXiv Detail & Related papers (2020-12-17T22:33:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.