Related papers: Efficient Image Generation with Variadic Attention Heads

Efficient Image Generation with Variadic Attention Heads

URL: http://arxiv.org/abs/2211.05770v3
Date: Thu, 26 Jun 2025 05:07:48 GMT
Title: Efficient Image Generation with Variadic Attention Heads
Authors: Steven Walton, Ali Hassani, Xingqian Xu, Zhangyang Wang, Humphrey Shi,
Abstract summary: We propose a simple, yet powerful method to allow the attention heads of a single transformer to attend to multiple receptive fields.<n>We demonstrate our method utilizing Neighborhood Attention (NA) and integrate it into a StyleGAN based architecture for image generation.<n>With this work, dubbed StyleNAT, we are able to achieve a FID of 2.05 on FFHQ, a 6% improvement over StyleGAN-XL.
Score: 66.9694645123474
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While the integration of transformers in vision models have yielded significant improvements on vision tasks they still require significant amounts of computation for both training and inference. Restricted attention mechanisms significantly reduce these computational burdens but come at the cost of losing either global or local coherence. We propose a simple, yet powerful method to reduce these trade-offs: allow the attention heads of a single transformer to attend to multiple receptive fields. We demonstrate our method utilizing Neighborhood Attention (NA) and integrate it into a StyleGAN based architecture for image generation. With this work, dubbed StyleNAT, we are able to achieve a FID of 2.05 on FFHQ, a 6% improvement over StyleGAN-XL, while utilizing 28% fewer parameters and with 4$\times$ the throughput capacity. StyleNAT achieves the Pareto Frontier on FFHQ-256 and demonstrates powerful and efficient image generation on other datasets. Our code and model checkpoints are publicly available at: https://github.com/SHI-Labs/StyleNAT

Related papers

S2AFormer: Strip Self-Attention for Efficient Vision Transformer [37.930090368513355]
Vision Transformer (ViT) has made significant advancements in computer vision.<n>Recent methods have combined the strengths of convolutions and self-attention to achieve better trade-offs.<n>We propose S2AFormer, an efficient Vision Transformer architecture featuring novel Strip Self-Attention (SSA)
arXiv Detail & Related papers (2025-05-28T10:17:23Z)
FlexDiT: Dynamic Token Density Control for Diffusion Transformer [31.799640242972373]
Diffusion Transformers (DiT) deliver impressive generative performance but face prohibitive computational demands. We propose FlexDiT, a framework that dynamically adapts token density across both spatial and temporal dimensions. Our experiments demonstrate FlexDiT's effectiveness, achieving a 55% reduction in FLOPs and a 175% improvement in inference speed.
arXiv Detail & Related papers (2024-12-08T18:59:16Z)
EdgeNAT: Transformer for Efficient Edge Detection [2.34098299695111]
We propose EdgeNAT, a one-stage transformer-based edge detector with DiNAT as the encoder. Experiments on multiple datasets show that our method achieves state-of-the-art performance on both RGB and depth images.
arXiv Detail & Related papers (2024-08-20T04:04:22Z)
LaMamba-Diff: Linear-Time High-Fidelity Diffusion Models Based on Local Attention and Mamba [54.85262314960038]
Local Attentional Mamba blocks capture both global contexts and local details with linear complexity. Our model exhibits exceptional scalability and surpasses the performance of DiT across various model scales on ImageNet at 256x256 resolution. Compared to state-of-the-art diffusion models on ImageNet 256x256 and 512x512, our largest model presents notable advantages, such as a reduction of up to 62% GFLOPs.
arXiv Detail & Related papers (2024-08-05T16:39:39Z)
Revisiting Non-Autoregressive Transformers for Efficient Image Synthesis [82.72941975704374]
Non-autoregressive Transformers (NATs) have been recognized for their rapid generation. We re-evaluate the full potential of NATs by revisiting the design of their training and inference strategies. We propose to go beyond existing methods by directly solving the optimal strategies in an automatic framework.
arXiv Detail & Related papers (2024-06-08T13:52:20Z)
Dynamic Pre-training: Towards Efficient and Scalable All-in-One Image Restoration [100.54419875604721]
All-in-one image restoration tackles different types of degradations with a unified model instead of having task-specific, non-generic models for each degradation. We propose DyNet, a dynamic family of networks designed in an encoder-decoder style for all-in-one image restoration tasks. Our DyNet can seamlessly switch between its bulkier and lightweight variants, thereby offering flexibility for efficient model deployment.
arXiv Detail & Related papers (2024-04-02T17:58:49Z)
Mutual-Guided Dynamic Network for Image Fusion [51.615598671899335]
We propose a novel mutual-guided dynamic network (MGDN) for image fusion, which allows for effective information utilization across different locations and inputs. Experimental results on five benchmark datasets demonstrate that our proposed method outperforms existing methods on four image fusion tasks.
arXiv Detail & Related papers (2023-08-24T03:50:37Z)
PointGPT: Auto-regressively Generative Pre-training from Point Clouds [45.488532108226565]
We present PointGPT, a novel approach that extends the concept of GPT to point clouds. Specifically, a point cloud auto-regressive generation task is proposed to pre-train transformer models. Our approach achieves classification accuracies of 94.9% on the ModelNet40 dataset and 93.4% on the ScanObjectNN dataset, outperforming all other transformer models.
arXiv Detail & Related papers (2023-05-19T07:39:04Z)
AdaPoinTr: Diverse Point Cloud Completion with Adaptive Geometry-Aware Transformers [94.11915008006483]
We present a new method that reformulates point cloud completion as a set-to-set translation problem. We design a new model, called PoinTr, which adopts a Transformer encoder-decoder architecture for point cloud completion. Our method attains 6.53 CD on PCN, 0.81 CD on ShapeNet-55 and 0.392 MMD on real-world KITTI.
arXiv Detail & Related papers (2023-01-11T16:14:12Z)
Megapixel Image Generation with Step-Unrolled Denoising Autoencoders [5.145313322824774]
We propose a combination of techniques to push sample resolutions higher and reduce computational requirements for training and sampling. These include vector-quantized GAN (VQ-GAN), a vector-quantization (VQ) model capable of high levels of lossy - but perceptually insignificant - compression; hourglass transformers, a highly scaleable self-attention model; and step-unrolled denoising autoencoders (SUNDAE), a non-autoregressive (NAR) text generative model. Our proposed framework scales to high-resolutions ($1024 times 1024$) and trains quickly (
arXiv Detail & Related papers (2022-06-24T15:47:42Z)
Green Hierarchical Vision Transformer for Masked Image Modeling [54.14989750044489]
We present an efficient approach for Masked Image Modeling with hierarchical Vision Transformers (ViTs) We design a Group Window Attention scheme following the Divide-and-Conquer strategy. We further improve the grouping strategy via the Dynamic Programming algorithm to minimize the overall cost of the attention on the grouped patches.
arXiv Detail & Related papers (2022-05-26T17:34:42Z)
Activating More Pixels in Image Super-Resolution Transformer [53.87533738125943]
Transformer-based methods have shown impressive performance in low-level vision tasks, such as image super-resolution. We propose a novel Hybrid Attention Transformer (HAT) to activate more input pixels for better reconstruction. Our overall method significantly outperforms the state-of-the-art methods by more than 1dB.
arXiv Detail & Related papers (2022-05-09T17:36:58Z)
UniFormer: Unifying Convolution and Self-attention for Visual Recognition [69.68907941116127]
Convolution neural networks (CNNs) and vision transformers (ViTs) have been two dominant frameworks in the past few years. We propose a novel Unified transFormer (UniFormer) which seamlessly integrates the merits of convolution and self-attention in a concise transformer format. Our UniFormer achieves 86.3 top-1 accuracy on ImageNet-1K classification.
arXiv Detail & Related papers (2022-01-24T04:39:39Z)
StyleSwin: Transformer-based GAN for High-resolution Image Generation [28.703687511694305]
We seek to explore using pure transformers to build a generative adversarial network for high-resolution image synthesis. Proposed generator adopts Swin transformer in a style-based architecture. We show that offering the knowledge of the absolute position that has been lost in window-based transformers greatly benefits the generation quality.
arXiv Detail & Related papers (2021-12-20T18:59:51Z)
Local-to-Global Self-Attention in Vision Transformers [130.0369761612812]
Transformers have demonstrated great potential in computer vision tasks. Some recent Transformer models adopt a hierarchical design, where self-attentions are only computed within local windows. This design significantly improves the efficiency but lacks global feature reasoning in early stages. In this work, we design a multi-path structure of the Transformer, which enables local-to-global reasoning at multiple granularities in each stage.
arXiv Detail & Related papers (2021-07-10T02:34:55Z)
CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows [99.36226415086243]
We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks. A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token.
arXiv Detail & Related papers (2021-07-01T17:59:56Z)
Styleformer: Transformer based Generative Adversarial Networks with Style Vector [5.025654873456756]
Styleformer is a style-based generator for GAN architecture, but a convolution-free transformer-based generator. We show how a transformer can generate high-quality images, overcoming the disadvantage that convolution operations are difficult to capture global features in an image.
arXiv Detail & Related papers (2021-06-13T15:30:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.