Related papers: DetailFlow: 1D Coarse-to-Fine Autoregressive Image Generation via Next-Detail Prediction

DetailFlow: 1D Coarse-to-Fine Autoregressive Image Generation via Next-Detail Prediction

URL: http://arxiv.org/abs/2505.21473v1
Date: Tue, 27 May 2025 17:45:21 GMT
Title: DetailFlow: 1D Coarse-to-Fine Autoregressive Image Generation via Next-Detail Prediction
Authors: Yiheng Liu, Liao Qu, Huichao Zhang, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Xian Li, Shuai Wang, Daniel K. Du, Shu Cheng, Zehuan Yuan, Xinglong Wu,
Abstract summary: This paper presents DetailFlow, a coarse-to-fine 1D autoregressive (AR) image generation method.<n>By learning a resolution-aware token sequence supervised with progressively degraded images, DetailFlow enables the generation process to start from the global structure.<n>Our method achieves high-quality image synthesis with significantly fewer tokens than previous approaches.
Score: 47.483590046908844
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper presents DetailFlow, a coarse-to-fine 1D autoregressive (AR) image generation method that models images through a novel next-detail prediction strategy. By learning a resolution-aware token sequence supervised with progressively degraded images, DetailFlow enables the generation process to start from the global structure and incrementally refine details. This coarse-to-fine 1D token sequence aligns well with the autoregressive inference mechanism, providing a more natural and efficient way for the AR model to generate complex visual content. Our compact 1D AR model achieves high-quality image synthesis with significantly fewer tokens than previous approaches, i.e. VAR/VQGAN. We further propose a parallel inference mechanism with self-correction that accelerates generation speed by approximately 8x while reducing accumulation sampling error inherent in teacher-forcing supervision. On the ImageNet 256x256 benchmark, our method achieves 2.96 gFID with 128 tokens, outperforming VAR (3.3 FID) and FlexVAR (3.05 FID), which both require 680 tokens in their AR models. Moreover, due to the significantly reduced token count and parallel inference mechanism, our method runs nearly 2x faster inference speed compared to VAR and FlexVAR. Extensive experimental results demonstrate DetailFlow's superior generation quality and efficiency compared to existing state-of-the-art methods.

Related papers

HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation [91.08481618973111]
Visual Auto-Regressive modeling ( VAR) has shown promise in bridging the speed and quality gap between autoregressive image models and diffusion models.<n>We introduce Hierarchical Masked Auto-Regressive modeling (HMAR) to generate high-quality images with fast sampling.<n>HMAR reformulates next-scale prediction as a Markovian process, wherein the prediction of each resolution scale is conditioned only on tokens in its immediate predecessor.
arXiv Detail & Related papers (2025-06-04T20:08:07Z)
Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer.<n>Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis.<n>In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z)
Autoregressive Image Generation with Randomized Parallel Decoding [23.714192351237628]
ARPG is a novel visual autoregressive model that enables randomized parallel generation.<n>Our approach attains an FID of 1.94 with only 64 sampling steps, achieving over a 20-fold increase in throughput.
arXiv Detail & Related papers (2025-03-13T17:19:51Z)
NFIG: Autoregressive Image Generation with Next-Frequency Prediction [50.69346038028673]
We present textbfNext-textbfFrequency textbfImage textbfGeneration (textbfNFIG), a novel framework that decomposes the image generation process into multiple frequency-guided stages.<n>Our approach first generates low-frequency components to establish global structure with fewer tokens, then progressively adds higher-frequency details, following the natural spectral hierarchy of images.
arXiv Detail & Related papers (2025-03-10T08:59:10Z)
FlexVAR: Flexible Visual Autoregressive Modeling without Residual Prediction [91.09318592542509]
This work challenges the residual prediction paradigm in visual autoregressive modeling.<n>It presents a new Flexible Visual AutoRegressive image generation paradigm.<n>This simple, intuitive approach swiftly learns visual distributions and makes the generation process more flexible and adaptable.
arXiv Detail & Related papers (2025-02-27T17:39:17Z)
ARFlow: Autoregressive Flow with Hybrid Linear Attention [48.707933347079894]
Flow models are effective at progressively generating realistic images.<n>They struggle to capture long-range dependencies during the generation process.<n>We propose integrating autoregressive modeling into flow models.
arXiv Detail & Related papers (2025-01-27T14:33:27Z)
Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient [52.96232442322824]
Collaborative Decoding (CoDe) is a novel efficient decoding strategy tailored for the Visual Auto-Regressive ( VAR) framework.<n>CoDe capitalizes on two critical observations: the substantially reduced parameter demands at larger scales and the exclusive generation patterns across different scales.<n>CoDe achieves a 1.7x speedup, slashes memory usage by around 50%, and preserves image quality with only a negligible FID increase from 1.95 to 1.98.
arXiv Detail & Related papers (2024-11-26T15:13:15Z)
Randomized Autoregressive Visual Generation [26.195148077398223]
This paper presents Randomized AutoRegressive modeling (RAR) for visual generation. RAR sets a new state-of-the-art performance on the image generation task while maintaining full compatibility with language modeling frameworks. On the ImageNet-256 benchmark, RAR achieves an FID score of 1.48, not only surpassing prior state-the-art autoregressive image generators but also outperforming leading diffusion-based and masked transformer-based methods.
arXiv Detail & Related papers (2024-11-01T17:59:58Z)
Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models and Time-Dependent Layer Normalization [26.926712014346432]
This paper presents innovative enhancements to diffusion models by integrating a novel multi-resolution network and time-dependent layer normalization.<n>Our method's efficacy is demonstrated on the class-conditional ImageNet generation benchmark, setting new state-of-the-art FID scores of 1.70 on ImageNet 256 x 256 and 2.89 on ImageNet 512 x 512.
arXiv Detail & Related papers (2024-06-13T17:59:58Z)
Progressive Text-to-Image Generation [40.09326229583334]
We present a progressive model for high-fidelity text-to-image generation. The proposed method takes effect by creating new image tokens from coarse to fine based on the existing context. The resulting coarse-to-fine hierarchy makes the image generation process intuitive and interpretable.
arXiv Detail & Related papers (2022-10-05T14:27:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.