Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation
- URL: http://arxiv.org/abs/2507.01957v1
- Date: Wed, 02 Jul 2025 17:59:23 GMT
- Title: Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation
- Authors: Zhuoyang Zhang, Luke J. Huang, Chengyue Wu, Shang Yang, Kelly Peng, Yao Lu, Song Han,
- Abstract summary: We present Locality-aware Parallel Decoding (LPD) to accelerate autoregressive image generation.<n>We reduce the generation steps from 256 to 20 (256$times$256 res.) and 1024 to 48 without compromising quality on the ImageNet class-conditional generation.
- Score: 10.421912048948634
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Locality-aware Parallel Decoding (LPD) to accelerate autoregressive image generation. Traditional autoregressive image generation relies on next-patch prediction, a memory-bound process that leads to high latency. Existing works have tried to parallelize next-patch prediction by shifting to multi-patch prediction to accelerate the process, but only achieved limited parallelization. To achieve high parallelization while maintaining generation quality, we introduce two key techniques: (1) Flexible Parallelized Autoregressive Modeling, a novel architecture that enables arbitrary generation ordering and degrees of parallelization. It uses learnable position query tokens to guide generation at target positions while ensuring mutual visibility among concurrently generated tokens for consistent parallel decoding. (2) Locality-aware Generation Ordering, a novel schedule that forms groups to minimize intra-group dependencies and maximize contextual support, enhancing generation quality. With these designs, we reduce the generation steps from 256 to 20 (256$\times$256 res.) and 1024 to 48 (512$\times$512 res.) without compromising quality on the ImageNet class-conditional generation, and achieving at least 3.4$\times$ lower latency than previous parallelized autoregressive models.
Related papers
- Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer.<n>Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis.<n>In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z) - Autoregressive Image Generation with Randomized Parallel Decoding [23.714192351237628]
ARPG is a novel visual autoregressive model that enables randomized parallel generation.<n>Our approach attains an FID of 1.94 with only 64 sampling steps, achieving over a 20-fold increase in throughput.
arXiv Detail & Related papers (2025-03-13T17:19:51Z) - Parallelized Autoregressive Visual Generation [65.9579525736345]
We propose a simple yet effective approach for parallelized autoregressive visual generation.<n>Our method achieves a 3.6x speedup with comparable quality and up to 9.5x speedup with minimal quality degradation across both image and video generation tasks.
arXiv Detail & Related papers (2024-12-19T17:59:54Z) - ParallelSpec: Parallel Drafter for Efficient Speculative Decoding [62.68430939686566]
We present ParallelSpec, an alternative to auto-regressive drafting strategies in state-of-the-art speculative decoding approaches.
In contrast to auto-regressive drafting in the speculative stage, we train a parallel drafter to serve as an efficient speculative model.
arXiv Detail & Related papers (2024-10-08T01:05:08Z) - Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding [60.188309982690335]
We propose a training-free probabilistic parallel decoding algorithm, Speculative Jacobi Decoding (SJD)<n>SJD accelerates the inference of auto-regressive text-to-image generation while maintaining the randomness in sampling-based token decoding.<n>Specifically, SJD facilitates the model to predict multiple tokens at each step and accepts tokens based on the probabilistic criterion.
arXiv Detail & Related papers (2024-10-02T16:05:27Z) - Emage: Non-Autoregressive Text-to-Image Generation [63.347052548210236]
Non-autoregressive text-to-image models efficiently generate hundreds of image tokens in parallel.
Our model with 346M parameters generates an image of 256$times$256 with about one second on one V100 GPU.
arXiv Detail & Related papers (2023-12-22T10:01:54Z) - Lformer: Text-to-Image Generation with L-shape Block Parallel Decoding [111.16221796950126]
We propose Lformer, a semi-autoregressive text-to-image generation model.
By leveraging the 2D structure of image tokens, Lformer achieves faster speed than the existing transformer-based methods.
Lformer can edit images without the requirement for finetuning.
arXiv Detail & Related papers (2023-03-07T11:10:22Z) - Progressive Text-to-Image Generation [40.09326229583334]
We present a progressive model for high-fidelity text-to-image generation.
The proposed method takes effect by creating new image tokens from coarse to fine based on the existing context.
The resulting coarse-to-fine hierarchy makes the image generation process intuitive and interpretable.
arXiv Detail & Related papers (2022-10-05T14:27:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.