Related papers: StyleNAT: Giving Each Head a New Perspective

StyleNAT: Giving Each Head a New Perspective

URL: http://arxiv.org/abs/2211.05770v2
Date: Sun, 13 Aug 2023 00:03:25 GMT
Title: StyleNAT: Giving Each Head a New Perspective
Authors: Steven Walton, Ali Hassani, Xingqian Xu, Zhangyang Wang, Humphrey Shi
Abstract summary: We present a new transformer-based framework, dubbed StyleNAT, targeting high-quality image generation with superior efficiency and flexibility. At the core of our model, is a carefully designed framework that partitions attention heads to capture local and global information. StyleNAT attains a new SOTA FID score on FFHQ-256 with 2.046, beating prior arts with convolutional models such as StyleGAN-XL and transformers such as HIT and StyleSwin.
Score: 71.84791905122052
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Image generation has been a long sought-after but challenging task, and performing the generation task in an efficient manner is similarly difficult. Often researchers attempt to create a "one size fits all" generator, where there are few differences in the parameter space for drastically different datasets. Herein, we present a new transformer-based framework, dubbed StyleNAT, targeting high-quality image generation with superior efficiency and flexibility. At the core of our model, is a carefully designed framework that partitions attention heads to capture local and global information, which is achieved through using Neighborhood Attention (NA). With different heads able to pay attention to varying receptive fields, the model is able to better combine this information, and adapt, in a highly flexible manner, to the data at hand. StyleNAT attains a new SOTA FID score on FFHQ-256 with 2.046, beating prior arts with convolutional models such as StyleGAN-XL and transformers such as HIT and StyleSwin, and a new transformer SOTA on FFHQ-1024 with an FID score of 4.174. These results show a 6.4% improvement on FFHQ-256 scores when compared to StyleGAN-XL with a 28% reduction in the number of parameters and 56% improvement in sampling throughput. Code and models will be open-sourced at https://github.com/SHI-Labs/StyleNAT.

Related papers

S2AFormer: Strip Self-Attention for Efficient Vision Transformer [37.930090368513355]
Vision Transformer (ViT) has made significant advancements in computer vision.<n>Recent methods have combined the strengths of convolutions and self-attention to achieve better trade-offs.<n>We propose S2AFormer, an efficient Vision Transformer architecture featuring novel Strip Self-Attention (SSA)
arXiv Detail & Related papers (2025-05-28T10:17:23Z)
FlexDiT: Dynamic Token Density Control for Diffusion Transformer [31.799640242972373]
Diffusion Transformers (DiT) deliver impressive generative performance but face prohibitive computational demands. We propose FlexDiT, a framework that dynamically adapts token density across both spatial and temporal dimensions. Our experiments demonstrate FlexDiT's effectiveness, achieving a 55% reduction in FLOPs and a 175% improvement in inference speed.
arXiv Detail & Related papers (2024-12-08T18:59:16Z)
EdgeNAT: Transformer for Efficient Edge Detection [2.34098299695111]
We propose EdgeNAT, a one-stage transformer-based edge detector with DiNAT as the encoder. Experiments on multiple datasets show that our method achieves state-of-the-art performance on both RGB and depth images.
arXiv Detail & Related papers (2024-08-20T04:04:22Z)
LaMamba-Diff: Linear-Time High-Fidelity Diffusion Models Based on Local Attention and Mamba [54.85262314960038]
Local Attentional Mamba blocks capture both global contexts and local details with linear complexity. Our model exhibits exceptional scalability and surpasses the performance of DiT across various model scales on ImageNet at 256x256 resolution. Compared to state-of-the-art diffusion models on ImageNet 256x256 and 512x512, our largest model presents notable advantages, such as a reduction of up to 62% GFLOPs.
arXiv Detail & Related papers (2024-08-05T16:39:39Z)
Revisiting Non-Autoregressive Transformers for Efficient Image Synthesis [82.72941975704374]
Non-autoregressive Transformers (NATs) have been recognized for their rapid generation. We re-evaluate the full potential of NATs by revisiting the design of their training and inference strategies. We propose to go beyond existing methods by directly solving the optimal strategies in an automatic framework.
arXiv Detail & Related papers (2024-06-08T13:52:20Z)
Dynamic Pre-training: Towards Efficient and Scalable All-in-One Image Restoration [100.54419875604721]
All-in-one image restoration tackles different types of degradations with a unified model instead of having task-specific, non-generic models for each degradation. We propose DyNet, a dynamic family of networks designed in an encoder-decoder style for all-in-one image restoration tasks. Our DyNet can seamlessly switch between its bulkier and lightweight variants, thereby offering flexibility for efficient model deployment.
arXiv Detail & Related papers (2024-04-02T17:58:49Z)
Mutual-Guided Dynamic Network for Image Fusion [51.615598671899335]
We propose a novel mutual-guided dynamic network (MGDN) for image fusion, which allows for effective information utilization across different locations and inputs. Experimental results on five benchmark datasets demonstrate that our proposed method outperforms existing methods on four image fusion tasks.
arXiv Detail & Related papers (2023-08-24T03:50:37Z)
PointGPT: Auto-regressively Generative Pre-training from Point Clouds [45.488532108226565]
We present PointGPT, a novel approach that extends the concept of GPT to point clouds. Specifically, a point cloud auto-regressive generation task is proposed to pre-train transformer models. Our approach achieves classification accuracies of 94.9% on the ModelNet40 dataset and 93.4% on the ScanObjectNN dataset, outperforming all other transformer models.
arXiv Detail & Related papers (2023-05-19T07:39:04Z)
AdaPoinTr: Diverse Point Cloud Completion with Adaptive Geometry-Aware Transformers [94.11915008006483]
We present a new method that reformulates point cloud completion as a set-to-set translation problem. We design a new model, called PoinTr, which adopts a Transformer encoder-decoder architecture for point cloud completion. Our method attains 6.53 CD on PCN, 0.81 CD on ShapeNet-55 and 0.392 MMD on real-world KITTI.
arXiv Detail & Related papers (2023-01-11T16:14:12Z)
Megapixel Image Generation with Step-Unrolled Denoising Autoencoders [5.145313322824774]
We propose a combination of techniques to push sample resolutions higher and reduce computational requirements for training and sampling. These include vector-quantized GAN (VQ-GAN), a vector-quantization (VQ) model capable of high levels of lossy - but perceptually insignificant - compression; hourglass transformers, a highly scaleable self-attention model; and step-unrolled denoising autoencoders (SUNDAE), a non-autoregressive (NAR) text generative model. Our proposed framework scales to high-resolutions ($1024 times 1024$) and trains quickly (
arXiv Detail & Related papers (2022-06-24T15:47:42Z)
Green Hierarchical Vision Transformer for Masked Image Modeling [54.14989750044489]
We present an efficient approach for Masked Image Modeling with hierarchical Vision Transformers (ViTs) We design a Group Window Attention scheme following the Divide-and-Conquer strategy. We further improve the grouping strategy via the Dynamic Programming algorithm to minimize the overall cost of the attention on the grouped patches.
arXiv Detail & Related papers (2022-05-26T17:34:42Z)
Activating More Pixels in Image Super-Resolution Transformer [53.87533738125943]
Transformer-based methods have shown impressive performance in low-level vision tasks, such as image super-resolution. We propose a novel Hybrid Attention Transformer (HAT) to activate more input pixels for better reconstruction. Our overall method significantly outperforms the state-of-the-art methods by more than 1dB.
arXiv Detail & Related papers (2022-05-09T17:36:58Z)
UniFormer: Unifying Convolution and Self-attention for Visual Recognition [69.68907941116127]
Convolution neural networks (CNNs) and vision transformers (ViTs) have been two dominant frameworks in the past few years. We propose a novel Unified transFormer (UniFormer) which seamlessly integrates the merits of convolution and self-attention in a concise transformer format. Our UniFormer achieves 86.3 top-1 accuracy on ImageNet-1K classification.
arXiv Detail & Related papers (2022-01-24T04:39:39Z)
StyleSwin: Transformer-based GAN for High-resolution Image Generation [28.703687511694305]
We seek to explore using pure transformers to build a generative adversarial network for high-resolution image synthesis. Proposed generator adopts Swin transformer in a style-based architecture. We show that offering the knowledge of the absolute position that has been lost in window-based transformers greatly benefits the generation quality.
arXiv Detail & Related papers (2021-12-20T18:59:51Z)
Local-to-Global Self-Attention in Vision Transformers [130.0369761612812]
Transformers have demonstrated great potential in computer vision tasks. Some recent Transformer models adopt a hierarchical design, where self-attentions are only computed within local windows. This design significantly improves the efficiency but lacks global feature reasoning in early stages. In this work, we design a multi-path structure of the Transformer, which enables local-to-global reasoning at multiple granularities in each stage.
arXiv Detail & Related papers (2021-07-10T02:34:55Z)
CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows [99.36226415086243]
We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks. A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token.
arXiv Detail & Related papers (2021-07-01T17:59:56Z)
Styleformer: Transformer based Generative Adversarial Networks with Style Vector [5.025654873456756]
Styleformer is a style-based generator for GAN architecture, but a convolution-free transformer-based generator. We show how a transformer can generate high-quality images, overcoming the disadvantage that convolution operations are difficult to capture global features in an image.
arXiv Detail & Related papers (2021-06-13T15:30:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.