From Sequential to Spatial: Reordering Autoregression for Efficient Visual Generation
- URL: http://arxiv.org/abs/2512.24639v1
- Date: Wed, 31 Dec 2025 05:24:07 GMT
- Title: From Sequential to Spatial: Reordering Autoregression for Efficient Visual Generation
- Authors: Siyang Wang, Hanting Li, Wei Li, Jie Hu, Xinghao Chen, Feng Zhao,
- Abstract summary: We propose RadAR, an efficient and parallelizable framework to accelerate autoregressive visual generation.<n>Our approach is motivated by observation that visual tokens exhibit strong local dependencies and spatial correlations with their neighbors--a property not fully exploited in standard-scan decoding orders.
- Score: 26.867135297190064
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Inspired by the remarkable success of autoregressive models in language modeling, this paradigm has been widely adopted in visual generation. However, the sequential token-by-token decoding mechanism inherent in traditional autoregressive models leads to low inference efficiency.In this paper, we propose RadAR, an efficient and parallelizable framework designed to accelerate autoregressive visual generation while preserving its representational capacity. Our approach is motivated by the observation that visual tokens exhibit strong local dependencies and spatial correlations with their neighbors--a property not fully exploited in standard raster-scan decoding orders. Specifically, we organize the generation process around a radial topology: an initial token is selected as the starting point, and all other tokens are systematically grouped into multiple concentric rings according to their spatial distances from this center. Generation then proceeds in a ring-wise manner, from inner to outer regions, enabling the parallel prediction of all tokens within the same ring. This design not only preserves the structural locality and spatial coherence of visual scenes but also substantially increases parallelization. Furthermore, to address the risk of inconsistent predictions arising from simultaneous token generation with limited context, we introduce a nested attention mechanism. This mechanism dynamically refines implausible outputs during the forward pass, thereby mitigating error accumulation and preventing model collapse. By integrating radial parallel prediction with dynamic output correction, RadAR significantly improves generation efficiency.
Related papers
- Learning to Expand Images for Efficient Visual Autoregressive Modeling [26.400433163290586]
We introduce Expanding Autoregressive Representation (EAR), a novel generation paradigm that emulates the human visual system's center-outward perception pattern.<n>EAR unfolds image tokens in a spiral order from the center and progressively expands outward, preserving spatial continuity and enabling efficient parallel decoding.
arXiv Detail & Related papers (2025-11-19T14:55:07Z) - IAR2: Improving Autoregressive Visual Generation with Semantic-Detail Associated Token Prediction [77.06211178777939]
IAR2 is an advanced autoregressive framework that enables a hierarchical semantic-detail synthesis process.<n>We show that IAR2 sets a new state-of-the-art for autoregressive image generation, achieving a FID of 1.50 on ImageNet.
arXiv Detail & Related papers (2025-10-08T12:08:21Z) - REAR: Rethinking Visual Autoregressive Models via Generator-Tokenizer Consistency Regularization [130.46612643194973]
reAR is a simple training strategy introducing a token-wise regularization objective.<n>On ImageNet, it reduces gFID from 3.02 to 1.86 and improves IS to 316.9 using a standardization-based tokenizer.<n>When applied to advanced tokenizers, it achieves a gFID of 1.42 with only 177M parameters, matching the performance with larger state-of-the-art diffusion models (675M)
arXiv Detail & Related papers (2025-10-06T02:48:13Z) - Multi-scale Autoregressive Models are Laplacian, Discrete, and Latent Diffusion Models in Disguise [0.6875312133832079]
We revisit Visual Autoregressive models through the lens of an iterative-refinement framework.<n>We formalise it as a deterministic forward process that constructs a Laplacian-style latent pyramid, paired with a learned backward process that reconstructs it in a small number of coarse-to-fine steps.
arXiv Detail & Related papers (2025-10-03T09:05:38Z) - Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation [85.82112629564942]
We propose TokenBridge, which maintains the strong representation capacity of continuous tokens while preserving the modeling simplicity of discrete tokens.<n>We introduce a dimension-wise quantization strategy that independently discretizes each feature dimension, paired with a lightweight autoregressive prediction mechanism.<n>Our approach achieves reconstruction and generation quality on par with continuous methods while using standard categorical prediction.
arXiv Detail & Related papers (2025-03-20T17:59:59Z) - Autoregressive Image Generation with Randomized Parallel Decoding [28.352741116124538]
We introduce ARPG, a novel visual autoregressive model that enables randomized parallel generation.<n>ARPG achieves over a 30 times speedup in inference and a 75 percent reduction in memory consumption.<n>On the ImageNet-1K 256 benchmark, our approach attains an FID of 1.83 with only 32 sampling steps.
arXiv Detail & Related papers (2025-03-13T17:19:51Z) - Parallelized Autoregressive Visual Generation [65.9579525736345]
We propose a simple yet effective approach for parallelized autoregressive visual generation.<n>Our method achieves a 3.6x speedup with comparable quality and up to 9.5x speedup with minimal quality degradation across both image and video generation tasks.
arXiv Detail & Related papers (2024-12-19T17:59:54Z) - COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks.
We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges.
Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z) - Non-autoregressive Sequence-to-Sequence Vision-Language Models [59.445765313094434]
We propose a parallel decoding sequence-to-sequence vision-language model that marginalizes over multiple inference paths in the decoder.<n>The model achieves performance on-par with its state-of-the-art autoregressive counterpart, but is faster at inference time.
arXiv Detail & Related papers (2024-03-04T17:34:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.