Autoregressive Image Generation with Linear Complexity: A Spatial-Aware Decay Perspective
- URL: http://arxiv.org/abs/2507.01652v1
- Date: Wed, 02 Jul 2025 12:27:06 GMT
- Title: Autoregressive Image Generation with Linear Complexity: A Spatial-Aware Decay Perspective
- Authors: Yuxin Mao, Zhen Qin, Jinxing Zhou, Hui Deng, Xuyang Shen, Bin Fan, Jing Zhang, Yiran Zhong, Yuchao Dai,
- Abstract summary: We present LASADGen, an autoregressive image generator that enables selective attention to relevant spatial contexts with linear complexity.<n>Experiments on ImageNet show LASADGen achieves state-of-the-art image generation performance and computational efficiency.
- Score: 47.87649021414188
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Autoregressive (AR) models have garnered significant attention in image generation for their ability to effectively capture both local and global structures within visual data. However, prevalent AR models predominantly rely on the transformer architectures, which are beset by quadratic computational complexity concerning input sequence length and substantial memory overhead due to the necessity of maintaining key-value caches. Although linear attention mechanisms have successfully reduced this burden in language models, our initial experiments reveal that they significantly degrade image generation quality because of their inability to capture critical long-range dependencies in visual data. We propose Linear Attention with Spatial-Aware Decay (LASAD), a novel attention mechanism that explicitly preserves genuine 2D spatial relationships within the flattened image sequences by computing position-dependent decay factors based on true 2D spatial location rather than 1D sequence positions. Based on this mechanism, we present LASADGen, an autoregressive image generator that enables selective attention to relevant spatial contexts with linear complexity. Experiments on ImageNet show LASADGen achieves state-of-the-art image generation performance and computational efficiency, bridging the gap between linear attention's efficiency and spatial understanding needed for high-quality generation.
Related papers
- LINA: Linear Autoregressive Image Generative Models with Continuous Tokens [56.80443965097921]
Autoregressive models with continuous tokens form a promising paradigm for visual generation, especially for text-to-image (T2I) synthesis.<n>We study how to design compute-efficient linear attention within this framework.<n>We present LINA, a simple and compute-efficient T2I model built entirely on linear attention, capable of generating high-fidelity 1024x1024 images from user instructions.
arXiv Detail & Related papers (2026-01-30T06:44:33Z) - One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation [33.56782043207013]
Feature Auto-Encoder (FAE) adapts pre-trained visual representations into low-dimensional latents suitable for generation using as little as a single attention layer.<n>FAE achieves strong performance across class-conditional and text-to-image benchmarks.
arXiv Detail & Related papers (2025-12-08T18:57:26Z) - Learning Spatial Decay for Vision Transformers [50.63391799053993]
Vision Transformers (ViTs) have revolutionized computer vision, yet their self-attention mechanism lacks explicit spatial inductive biases.<n>Existing approaches introduce data-independent spatial decay based on fixed distance metrics.<n>We present the first successful adaptation of data-dependent spatial decay to 2D vision transformers.
arXiv Detail & Related papers (2025-08-13T06:18:32Z) - CS-VLM: Compressed Sensing Attention for Efficient Vision-Language Representation Learning [0.0]
We introduce the Compressed Sensing Attention Transformer (CSAT), a novel architecture that reimagines attention computation through the lens of compressed sensing.<n>CSAT exploits the inherent compressibility of both visual and textual representations especially evident in video, where temporal redundancy is high, and in language, where cross-modal grounding is often sparse.
arXiv Detail & Related papers (2025-06-30T02:11:20Z) - Make It Efficient: Dynamic Sparse Attention for Autoregressive Image Generation [8.624395048491275]
We propose a novel training-free context optimization method called Adaptive Dynamic Sparse Attention (ADSA)<n>ADSA identifies historical tokens crucial for maintaining local texture consistency and those essential for ensuring global semantic coherence, thereby efficiently streamlining attention.<n>We also introduce a dynamic KV-cache update mechanism tailored for ADSA, reducing GPU memory consumption during inference by approximately $50%$.
arXiv Detail & Related papers (2025-06-23T01:27:06Z) - Breaking Complexity Barriers: High-Resolution Image Restoration with Rank Enhanced Linear Attention [54.42902794496325]
Linear attention, a variant of softmax attention, demonstrates promise in global context modeling.<n>We propose Rank Enhanced Linear Attention (RELA), a simple yet effective method that enriches feature representations by integrating a lightweight depthwise convolution.<n>Building upon RELA, we propose an efficient and effective image restoration Transformer, named LAformer.
arXiv Detail & Related papers (2025-05-22T02:57:23Z) - Parallel Sequence Modeling via Generalized Spatial Propagation Network [80.66202109995726]
Generalized Spatial Propagation Network (GSPN) is a new attention mechanism for optimized vision tasks that inherently captures 2D spatial structures.<n>GSPN overcomes limitations by directly operating on spatially coherent image data and forming dense pairwise connections through a line-scan approach.<n>GSPN achieves superior spatial fidelity and state-of-the-art performance in vision tasks, including ImageNet classification, class-guided image generation, and text-to-image generation.
arXiv Detail & Related papers (2025-01-21T18:56:19Z) - Efficient High-Resolution Visual Representation Learning with State Space Model for Human Pose Estimation [60.80423207808076]
Capturing long-range dependencies while preserving high-resolution visual representations is crucial for dense prediction tasks such as human pose estimation.<n>We propose the Dynamic Visual State Space (DVSS) block, which augments visual state space models with multi-scale convolutional operations.<n>We build HRVMamba, a novel model for efficient high-resolution representation learning.
arXiv Detail & Related papers (2024-10-04T06:19:29Z) - Coarse-Fine Spectral-Aware Deformable Convolution For Hyperspectral Image Reconstruction [15.537910100051866]
We study the inverse problem of Coded Aperture Snapshot Spectral Imaging (CASSI)
We propose Coarse-Fine Spectral-Aware Deformable Convolution Network (CFSDCN)
Our CFSDCN significantly outperforms previous state-of-the-art (SOTA) methods on both simulated and real HSI datasets.
arXiv Detail & Related papers (2024-06-18T15:15:12Z) - Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences [60.489682735061415]
We propose CHELA, which replaces state space models with short-long convolutions and implements linear attention in a divide-and-conquer manner.
Our experiments on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-06-12T12:12:38Z) - ESSAformer: Efficient Transformer for Hyperspectral Image
Super-resolution [76.7408734079706]
Single hyperspectral image super-resolution (single-HSI-SR) aims to restore a high-resolution hyperspectral image from a low-resolution observation.
We propose ESSAformer, an ESSA attention-embedded Transformer network for single-HSI-SR with an iterative refining structure.
arXiv Detail & Related papers (2023-07-26T07:45:14Z) - DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation [56.514462874501675]
We propose a dynamic sparse attention based Transformer model to achieve fine-level matching with favorable efficiency.
The heart of our approach is a novel dynamic-attention unit, dedicated to covering the variation on the optimal number of tokens one position should focus on.
Experiments on three applications, pose-guided person image generation, edge-based face synthesis, and undistorted image style transfer, demonstrate that DynaST achieves superior performance in local details.
arXiv Detail & Related papers (2022-07-13T11:12:03Z) - Ripple Attention for Visual Perception with Sub-quadratic Complexity [7.425337104538644]
Transformer architectures are now central to modeling in natural language processing tasks.
We propose ripple attention, a sub-quadratic attention mechanism for visual perception.
In ripple attention, contributions of different tokens to a query are weighted with respect to their relative spatial distances in the 2D space.
arXiv Detail & Related papers (2021-10-06T02:00:38Z) - Grid Partitioned Attention: Efficient TransformerApproximation with
Inductive Bias for High Resolution Detail Generation [3.4373727078460665]
We present Grid Partitioned Attention (GPA), a new approximate attention algorithm.
Our paper introduces the new attention layer, analyzes its complexity and how the trade-off between memory usage and model power can be tuned.
Our contributions are (i) algorithm and code1of the novel GPA layer, (ii) a novel deep attention-copying architecture, and (iii) new state-of-the art experimental results in human pose morphing generation benchmarks.
arXiv Detail & Related papers (2021-07-08T10:37:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.