StyleSwin: Transformer-based GAN for High-resolution Image Generation
- URL: http://arxiv.org/abs/2112.10762v1
- Date: Mon, 20 Dec 2021 18:59:51 GMT
- Title: StyleSwin: Transformer-based GAN for High-resolution Image Generation
- Authors: Bowen Zhang, Shuyang Gu, Bo Zhang, Jianmin Bao, Dong Chen, Fang Wen,
Yong Wang, Baining Guo
- Abstract summary: We seek to explore using pure transformers to build a generative adversarial network for high-resolution image synthesis.
Proposed generator adopts Swin transformer in a style-based architecture.
We show that offering the knowledge of the absolute position that has been lost in window-based transformers greatly benefits the generation quality.
- Score: 28.703687511694305
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the tantalizing success in a broad of vision tasks, transformers have
not yet demonstrated on-par ability as ConvNets in high-resolution image
generative modeling. In this paper, we seek to explore using pure transformers
to build a generative adversarial network for high-resolution image synthesis.
To this end, we believe that local attention is crucial to strike the balance
between computational efficiency and modeling capacity. Hence, the proposed
generator adopts Swin transformer in a style-based architecture. To achieve a
larger receptive field, we propose double attention which simultaneously
leverages the context of the local and the shifted windows, leading to improved
generation quality. Moreover, we show that offering the knowledge of the
absolute position that has been lost in window-based transformers greatly
benefits the generation quality. The proposed StyleSwin is scalable to high
resolutions, with both the coarse geometry and fine structures benefit from the
strong expressivity of transformers. However, blocking artifacts occur during
high-resolution synthesis because performing the local attention in a
block-wise manner may break the spatial coherency. To solve this, we
empirically investigate various solutions, among which we find that employing a
wavelet discriminator to examine the spectral discrepancy effectively
suppresses the artifacts. Extensive experiments show the superiority over prior
transformer-based GANs, especially on high resolutions, e.g., 1024x1024. The
StyleSwin, without complex training strategies, excels over StyleGAN on
CelebA-HQ 1024, and achieves on-par performance on FFHQ-1024, proving the
promise of using transformers for high-resolution image generation. The code
and models will be available at https://github.com/microsoft/StyleSwin.
Related papers
- SwinStyleformer is a favorable choice for image inversion [2.8115030277940947]
This paper proposes the first pure Transformer structure inversion network called SwinStyleformer.
Experiments found that the inversion network with the Transformer backbone could not successfully invert the image.
arXiv Detail & Related papers (2024-06-19T02:08:45Z) - Towards Lightweight Transformer via Group-wise Transformation for
Vision-and-Language Tasks [126.33843752332139]
We introduce Group-wise Transformation towards a universal yet lightweight Transformer for vision-and-language tasks, termed as LW-Transformer.
We apply LW-Transformer to a set of Transformer-based networks, and quantitatively measure them on three vision-and-language tasks and six benchmark datasets.
Experimental results show that while saving a large number of parameters and computations, LW-Transformer achieves very competitive performance against the original Transformer networks for vision-and-language tasks.
arXiv Detail & Related papers (2022-04-16T11:30:26Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - The Nuts and Bolts of Adopting Transformer in GANs [124.30856952272913]
We investigate the properties of Transformer in the generative adversarial network (GAN) framework for high-fidelity image synthesis.
Our study leads to a new alternative design of Transformers in GAN, a convolutional neural network (CNN)-free generator termed as STrans-G.
arXiv Detail & Related papers (2021-10-25T17:01:29Z) - HRFormer: High-Resolution Transformer for Dense Prediction [99.6060997466614]
We present a High-Resolution Transformer (HRFormer) that learns high-resolution representations for dense prediction tasks.
We take advantage of the multi-resolution parallel design introduced in high-resolution convolutional networks (HRNet)
We demonstrate the effectiveness of the High-Resolution Transformer on both human pose estimation and semantic segmentation tasks.
arXiv Detail & Related papers (2021-10-18T15:37:58Z) - Glance-and-Gaze Vision Transformer [13.77016463781053]
We propose a new vision Transformer, named Glance-and-Gaze Transformer (GG-Transformer)
It is motivated by the Glance and Gaze behavior of human beings when recognizing objects in natural scenes.
We empirically demonstrate our method achieves consistently superior performance over previous state-of-the-art Transformers.
arXiv Detail & Related papers (2021-06-04T06:13:47Z) - Combining Transformer Generators with Convolutional Discriminators [9.83490307808789]
Recently proposed TransGAN is the first GAN using only transformer-based architectures.
TransGAN requires data augmentation, an auxiliary super-resolution task during training, and a masking prior to guide the self-attention mechanism.
We evaluate our approach by conducting a benchmark of well-known CNN discriminators, ablate the size of the transformer-based generator, and show that combining both architectural elements into a hybrid model leads to better results.
arXiv Detail & Related papers (2021-05-21T07:56:59Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - TransGAN: Two Transformers Can Make One Strong GAN [111.07699201175919]
We conduct the first pilot study in building a GAN textbfcompletely free of convolutions, using only pure transformer-based architectures.
Our vanilla GAN architecture, dubbed textbfTransGAN, consists of a memory-friendly transformer-based generator.
Our best architecture achieves highly competitive performance compared to current state-of-the-art GANs based on convolutional backbones.
arXiv Detail & Related papers (2021-02-14T05:24:48Z) - Toward Transformer-Based Object Detection [12.704056181392415]
Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results.
ViT-FRCNN demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance.
We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.
arXiv Detail & Related papers (2020-12-17T22:33:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.