StyleNAT: Giving Each Head a New Perspective
- URL: http://arxiv.org/abs/2211.05770v2
- Date: Sun, 13 Aug 2023 00:03:25 GMT
- Title: StyleNAT: Giving Each Head a New Perspective
- Authors: Steven Walton, Ali Hassani, Xingqian Xu, Zhangyang Wang, Humphrey Shi
- Abstract summary: We present a new transformer-based framework, dubbed StyleNAT, targeting high-quality image generation with superior efficiency and flexibility.
At the core of our model, is a carefully designed framework that partitions attention heads to capture local and global information.
StyleNAT attains a new SOTA FID score on FFHQ-256 with 2.046, beating prior arts with convolutional models such as StyleGAN-XL and transformers such as HIT and StyleSwin.
- Score: 71.84791905122052
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image generation has been a long sought-after but challenging task, and
performing the generation task in an efficient manner is similarly difficult.
Often researchers attempt to create a "one size fits all" generator, where
there are few differences in the parameter space for drastically different
datasets. Herein, we present a new transformer-based framework, dubbed
StyleNAT, targeting high-quality image generation with superior efficiency and
flexibility. At the core of our model, is a carefully designed framework that
partitions attention heads to capture local and global information, which is
achieved through using Neighborhood Attention (NA). With different heads able
to pay attention to varying receptive fields, the model is able to better
combine this information, and adapt, in a highly flexible manner, to the data
at hand. StyleNAT attains a new SOTA FID score on FFHQ-256 with 2.046, beating
prior arts with convolutional models such as StyleGAN-XL and transformers such
as HIT and StyleSwin, and a new transformer SOTA on FFHQ-1024 with an FID score
of 4.174. These results show a 6.4% improvement on FFHQ-256 scores when
compared to StyleGAN-XL with a 28% reduction in the number of parameters and
56% improvement in sampling throughput. Code and models will be open-sourced at
https://github.com/SHI-Labs/StyleNAT.
Related papers
- EdgeNAT: Transformer for Efficient Edge Detection [2.34098299695111]
We propose EdgeNAT, a one-stage transformer-based edge detector with DiNAT as the encoder.
Experiments on multiple datasets show that our method achieves state-of-the-art performance on both RGB and depth images.
arXiv Detail & Related papers (2024-08-20T04:04:22Z) - LaMamba-Diff: Linear-Time High-Fidelity Diffusion Models Based on Local Attention and Mamba [54.85262314960038]
Local Attentional Mamba blocks capture both global contexts and local details with linear complexity.
Our model exhibits exceptional scalability and surpasses the performance of DiT across various model scales on ImageNet at 256x256 resolution.
Compared to state-of-the-art diffusion models on ImageNet 256x256 and 512x512, our largest model presents notable advantages, such as a reduction of up to 62% GFLOPs.
arXiv Detail & Related papers (2024-08-05T16:39:39Z) - Dynamic Pre-training: Towards Efficient and Scalable All-in-One Image Restoration [100.54419875604721]
All-in-one image restoration tackles different types of degradations with a unified model instead of having task-specific, non-generic models for each degradation.
We propose DyNet, a dynamic family of networks designed in an encoder-decoder style for all-in-one image restoration tasks.
Our DyNet can seamlessly switch between its bulkier and lightweight variants, thereby offering flexibility for efficient model deployment.
arXiv Detail & Related papers (2024-04-02T17:58:49Z) - Mutual-Guided Dynamic Network for Image Fusion [51.615598671899335]
We propose a novel mutual-guided dynamic network (MGDN) for image fusion, which allows for effective information utilization across different locations and inputs.
Experimental results on five benchmark datasets demonstrate that our proposed method outperforms existing methods on four image fusion tasks.
arXiv Detail & Related papers (2023-08-24T03:50:37Z) - PointGPT: Auto-regressively Generative Pre-training from Point Clouds [45.488532108226565]
We present PointGPT, a novel approach that extends the concept of GPT to point clouds.
Specifically, a point cloud auto-regressive generation task is proposed to pre-train transformer models.
Our approach achieves classification accuracies of 94.9% on the ModelNet40 dataset and 93.4% on the ScanObjectNN dataset, outperforming all other transformer models.
arXiv Detail & Related papers (2023-05-19T07:39:04Z) - AdaPoinTr: Diverse Point Cloud Completion with Adaptive Geometry-Aware
Transformers [94.11915008006483]
We present a new method that reformulates point cloud completion as a set-to-set translation problem.
We design a new model, called PoinTr, which adopts a Transformer encoder-decoder architecture for point cloud completion.
Our method attains 6.53 CD on PCN, 0.81 CD on ShapeNet-55 and 0.392 MMD on real-world KITTI.
arXiv Detail & Related papers (2023-01-11T16:14:12Z) - Megapixel Image Generation with Step-Unrolled Denoising Autoencoders [5.145313322824774]
We propose a combination of techniques to push sample resolutions higher and reduce computational requirements for training and sampling.
These include vector-quantized GAN (VQ-GAN), a vector-quantization (VQ) model capable of high levels of lossy - but perceptually insignificant - compression; hourglass transformers, a highly scaleable self-attention model; and step-unrolled denoising autoencoders (SUNDAE), a non-autoregressive (NAR) text generative model.
Our proposed framework scales to high-resolutions ($1024 times 1024$) and trains quickly (
arXiv Detail & Related papers (2022-06-24T15:47:42Z) - CSWin Transformer: A General Vision Transformer Backbone with
Cross-Shaped Windows [99.36226415086243]
We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks.
A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token.
arXiv Detail & Related papers (2021-07-01T17:59:56Z) - Styleformer: Transformer based Generative Adversarial Networks with
Style Vector [5.025654873456756]
Styleformer is a style-based generator for GAN architecture, but a convolution-free transformer-based generator.
We show how a transformer can generate high-quality images, overcoming the disadvantage that convolution operations are difficult to capture global features in an image.
arXiv Detail & Related papers (2021-06-13T15:30:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.