Autoregressive Image Generation with Linear Complexity: A Spatial-Aware Decay Perspective
- URL: http://arxiv.org/abs/2507.01652v1
- Date: Wed, 02 Jul 2025 12:27:06 GMT
- Title: Autoregressive Image Generation with Linear Complexity: A Spatial-Aware Decay Perspective
- Authors: Yuxin Mao, Zhen Qin, Jinxing Zhou, Hui Deng, Xuyang Shen, Bin Fan, Jing Zhang, Yiran Zhong, Yuchao Dai,
- Abstract summary: We present LASADGen, an autoregressive image generator that enables selective attention to relevant spatial contexts with linear complexity.<n>Experiments on ImageNet show LASADGen achieves state-of-the-art image generation performance and computational efficiency.
- Score: 47.87649021414188
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Autoregressive (AR) models have garnered significant attention in image generation for their ability to effectively capture both local and global structures within visual data. However, prevalent AR models predominantly rely on the transformer architectures, which are beset by quadratic computational complexity concerning input sequence length and substantial memory overhead due to the necessity of maintaining key-value caches. Although linear attention mechanisms have successfully reduced this burden in language models, our initial experiments reveal that they significantly degrade image generation quality because of their inability to capture critical long-range dependencies in visual data. We propose Linear Attention with Spatial-Aware Decay (LASAD), a novel attention mechanism that explicitly preserves genuine 2D spatial relationships within the flattened image sequences by computing position-dependent decay factors based on true 2D spatial location rather than 1D sequence positions. Based on this mechanism, we present LASADGen, an autoregressive image generator that enables selective attention to relevant spatial contexts with linear complexity. Experiments on ImageNet show LASADGen achieves state-of-the-art image generation performance and computational efficiency, bridging the gap between linear attention's efficiency and spatial understanding needed for high-quality generation.
Related papers
- CS-VLM: Compressed Sensing Attention for Efficient Vision-Language Representation Learning [0.0]
We introduce the Compressed Sensing Attention Transformer (CSAT), a novel architecture that reimagines attention computation through the lens of compressed sensing.<n>CSAT exploits the inherent compressibility of both visual and textual representations especially evident in video, where temporal redundancy is high, and in language, where cross-modal grounding is often sparse.
arXiv Detail & Related papers (2025-06-30T02:11:20Z) - Make It Efficient: Dynamic Sparse Attention for Autoregressive Image Generation [8.624395048491275]
We propose a novel training-free context optimization method called Adaptive Dynamic Sparse Attention (ADSA)<n>ADSA identifies historical tokens crucial for maintaining local texture consistency and those essential for ensuring global semantic coherence, thereby efficiently streamlining attention.<n>We also introduce a dynamic KV-cache update mechanism tailored for ADSA, reducing GPU memory consumption during inference by approximately $50%$.
arXiv Detail & Related papers (2025-06-23T01:27:06Z) - Breaking Complexity Barriers: High-Resolution Image Restoration with Rank Enhanced Linear Attention [54.42902794496325]
Linear attention, a variant of softmax attention, demonstrates promise in global context modeling.<n>We propose Rank Enhanced Linear Attention (RELA), a simple yet effective method that enriches feature representations by integrating a lightweight depthwise convolution.<n>Building upon RELA, we propose an efficient and effective image restoration Transformer, named LAformer.
arXiv Detail & Related papers (2025-05-22T02:57:23Z) - Parallel Sequence Modeling via Generalized Spatial Propagation Network [80.66202109995726]
Generalized Spatial Propagation Network (GSPN) is a new attention mechanism for optimized vision tasks that inherently captures 2D spatial structures.<n>GSPN overcomes limitations by directly operating on spatially coherent image data and forming dense pairwise connections through a line-scan approach.<n>GSPN achieves superior spatial fidelity and state-of-the-art performance in vision tasks, including ImageNet classification, class-guided image generation, and text-to-image generation.
arXiv Detail & Related papers (2025-01-21T18:56:19Z) - Coarse-Fine Spectral-Aware Deformable Convolution For Hyperspectral Image Reconstruction [15.537910100051866]
We study the inverse problem of Coded Aperture Snapshot Spectral Imaging (CASSI)
We propose Coarse-Fine Spectral-Aware Deformable Convolution Network (CFSDCN)
Our CFSDCN significantly outperforms previous state-of-the-art (SOTA) methods on both simulated and real HSI datasets.
arXiv Detail & Related papers (2024-06-18T15:15:12Z) - Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences [60.489682735061415]
We propose CHELA, which replaces state space models with short-long convolutions and implements linear attention in a divide-and-conquer manner.
Our experiments on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-06-12T12:12:38Z) - ESSAformer: Efficient Transformer for Hyperspectral Image
Super-resolution [76.7408734079706]
Single hyperspectral image super-resolution (single-HSI-SR) aims to restore a high-resolution hyperspectral image from a low-resolution observation.
We propose ESSAformer, an ESSA attention-embedded Transformer network for single-HSI-SR with an iterative refining structure.
arXiv Detail & Related papers (2023-07-26T07:45:14Z) - DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation [56.514462874501675]
We propose a dynamic sparse attention based Transformer model to achieve fine-level matching with favorable efficiency.
The heart of our approach is a novel dynamic-attention unit, dedicated to covering the variation on the optimal number of tokens one position should focus on.
Experiments on three applications, pose-guided person image generation, edge-based face synthesis, and undistorted image style transfer, demonstrate that DynaST achieves superior performance in local details.
arXiv Detail & Related papers (2022-07-13T11:12:03Z) - Ripple Attention for Visual Perception with Sub-quadratic Complexity [7.425337104538644]
Transformer architectures are now central to modeling in natural language processing tasks.
We propose ripple attention, a sub-quadratic attention mechanism for visual perception.
In ripple attention, contributions of different tokens to a query are weighted with respect to their relative spatial distances in the 2D space.
arXiv Detail & Related papers (2021-10-06T02:00:38Z) - Grid Partitioned Attention: Efficient TransformerApproximation with
Inductive Bias for High Resolution Detail Generation [3.4373727078460665]
We present Grid Partitioned Attention (GPA), a new approximate attention algorithm.
Our paper introduces the new attention layer, analyzes its complexity and how the trade-off between memory usage and model power can be tuned.
Our contributions are (i) algorithm and code1of the novel GPA layer, (ii) a novel deep attention-copying architecture, and (iii) new state-of-the art experimental results in human pose morphing generation benchmarks.
arXiv Detail & Related papers (2021-07-08T10:37:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.