StyleNAT: Giving Each Head a New Perspective
- URL: http://arxiv.org/abs/2211.05770v2
- Date: Sun, 13 Aug 2023 00:03:25 GMT
- Title: StyleNAT: Giving Each Head a New Perspective
- Authors: Steven Walton, Ali Hassani, Xingqian Xu, Zhangyang Wang, Humphrey Shi
- Abstract summary: We present a new transformer-based framework, dubbed StyleNAT, targeting high-quality image generation with superior efficiency and flexibility.
At the core of our model, is a carefully designed framework that partitions attention heads to capture local and global information.
StyleNAT attains a new SOTA FID score on FFHQ-256 with 2.046, beating prior arts with convolutional models such as StyleGAN-XL and transformers such as HIT and StyleSwin.
- Score: 71.84791905122052
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image generation has been a long sought-after but challenging task, and
performing the generation task in an efficient manner is similarly difficult.
Often researchers attempt to create a "one size fits all" generator, where
there are few differences in the parameter space for drastically different
datasets. Herein, we present a new transformer-based framework, dubbed
StyleNAT, targeting high-quality image generation with superior efficiency and
flexibility. At the core of our model, is a carefully designed framework that
partitions attention heads to capture local and global information, which is
achieved through using Neighborhood Attention (NA). With different heads able
to pay attention to varying receptive fields, the model is able to better
combine this information, and adapt, in a highly flexible manner, to the data
at hand. StyleNAT attains a new SOTA FID score on FFHQ-256 with 2.046, beating
prior arts with convolutional models such as StyleGAN-XL and transformers such
as HIT and StyleSwin, and a new transformer SOTA on FFHQ-1024 with an FID score
of 4.174. These results show a 6.4% improvement on FFHQ-256 scores when
compared to StyleGAN-XL with a 28% reduction in the number of parameters and
56% improvement in sampling throughput. Code and models will be open-sourced at
https://github.com/SHI-Labs/StyleNAT.
Related papers
- DiffiT: Diffusion Vision Transformers for Image Generation [88.08529836125399]
Vision Transformer (ViT) has demonstrated strong modeling capabilities and scalability, especially for recognition tasks.
We study the effectiveness of ViTs in diffusion-based generative learning and propose a new model denoted as Diffusion Vision Transformers (DiffiT)
DiffiT is surprisingly effective in generating high-fidelity images with significantly better parameter efficiency.
arXiv Detail & Related papers (2023-12-04T18:57:01Z) - Mutual-Guided Dynamic Network for Image Fusion [51.615598671899335]
We propose a novel mutual-guided dynamic network (MGDN) for image fusion, which allows for effective information utilization across different locations and inputs.
Experimental results on five benchmark datasets demonstrate that our proposed method outperforms existing methods on four image fusion tasks.
arXiv Detail & Related papers (2023-08-24T03:50:37Z) - PointGPT: Auto-regressively Generative Pre-training from Point Clouds [45.488532108226565]
We present PointGPT, a novel approach that extends the concept of GPT to point clouds.
Specifically, a point cloud auto-regressive generation task is proposed to pre-train transformer models.
Our approach achieves classification accuracies of 94.9% on the ModelNet40 dataset and 93.4% on the ScanObjectNN dataset, outperforming all other transformer models.
arXiv Detail & Related papers (2023-05-19T07:39:04Z) - AdaPoinTr: Diverse Point Cloud Completion with Adaptive Geometry-Aware
Transformers [94.11915008006483]
We present a new method that reformulates point cloud completion as a set-to-set translation problem.
We design a new model, called PoinTr, which adopts a Transformer encoder-decoder architecture for point cloud completion.
Our method attains 6.53 CD on PCN, 0.81 CD on ShapeNet-55 and 0.392 MMD on real-world KITTI.
arXiv Detail & Related papers (2023-01-11T16:14:12Z) - Megapixel Image Generation with Step-Unrolled Denoising Autoencoders [5.145313322824774]
We propose a combination of techniques to push sample resolutions higher and reduce computational requirements for training and sampling.
These include vector-quantized GAN (VQ-GAN), a vector-quantization (VQ) model capable of high levels of lossy - but perceptually insignificant - compression; hourglass transformers, a highly scaleable self-attention model; and step-unrolled denoising autoencoders (SUNDAE), a non-autoregressive (NAR) text generative model.
Our proposed framework scales to high-resolutions ($1024 times 1024$) and trains quickly (
arXiv Detail & Related papers (2022-06-24T15:47:42Z) - StyleSwin: Transformer-based GAN for High-resolution Image Generation [28.703687511694305]
We seek to explore using pure transformers to build a generative adversarial network for high-resolution image synthesis.
Proposed generator adopts Swin transformer in a style-based architecture.
We show that offering the knowledge of the absolute position that has been lost in window-based transformers greatly benefits the generation quality.
arXiv Detail & Related papers (2021-12-20T18:59:51Z) - CSWin Transformer: A General Vision Transformer Backbone with
Cross-Shaped Windows [99.36226415086243]
We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks.
A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token.
arXiv Detail & Related papers (2021-07-01T17:59:56Z) - Global Filter Networks for Image Classification [90.81352483076323]
We present a conceptually simple yet computationally efficient architecture that learns long-term spatial dependencies in the frequency domain with log-linear complexity.
Our results demonstrate that GFNet can be a very competitive alternative to transformer-style models and CNNs in efficiency, generalization ability and robustness.
arXiv Detail & Related papers (2021-07-01T17:58:16Z) - Styleformer: Transformer based Generative Adversarial Networks with
Style Vector [5.025654873456756]
Styleformer is a style-based generator for GAN architecture, but a convolution-free transformer-based generator.
We show how a transformer can generate high-quality images, overcoming the disadvantage that convolution operations are difficult to capture global features in an image.
arXiv Detail & Related papers (2021-06-13T15:30:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.