D-AR: Diffusion via Autoregressive Models
- URL: http://arxiv.org/abs/2505.23660v1
- Date: Thu, 29 May 2025 17:09:25 GMT
- Title: D-AR: Diffusion via Autoregressive Models
- Authors: Ziteng Gao, Mike Zheng Shou,
- Abstract summary: Diffusion via Autoregressive models (D-AR) is a new paradigm recasting the image diffusion process as a vanilla autoregressive procedure.<n>Our method achieves 2.09 FID using a 775M Llama backbone with 256 discrete tokens.
- Score: 21.03363985989625
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents Diffusion via Autoregressive models (D-AR), a new paradigm recasting the image diffusion process as a vanilla autoregressive procedure in the standard next-token-prediction fashion. We start by designing the tokenizer that converts images into sequences of discrete tokens, where tokens in different positions can be decoded into different diffusion denoising steps in the pixel space. Thanks to the diffusion properties, these tokens naturally follow a coarse-to-fine order, which directly lends itself to autoregressive modeling. Therefore, we apply standard next-token prediction on these tokens, without modifying any underlying designs (either causal masks or training/inference strategies), and such sequential autoregressive token generation directly mirrors the diffusion procedure in image space. That is, once the autoregressive model generates an increment of tokens, we can directly decode these tokens into the corresponding diffusion denoising step in the streaming manner. Our pipeline naturally reveals several intriguing properties, for example, it supports consistent previews when generating only a subset of tokens and enables zero-shot layout-controlled synthesis. On the standard ImageNet benchmark, our method achieves 2.09 FID using a 775M Llama backbone with 256 discrete tokens. We hope our work can inspire future research on unified autoregressive architectures of visual synthesis, especially with large language models. Code and models will be available at https://github.com/showlab/D-AR
Related papers
- AliTok: Towards Sequence Modeling Alignment between Tokenizer and Autoregressive Model [59.065471969232284]
We propose a novel Aligned Tokenizer (AliTok) to align the tokenizer and autoregressive model.<n>On ImageNet-256 benchmark, using a standard decoder-only autoregressive model as the generator, AliTok achieves a gFID score of 1.50 and an IS of 305.9.<n>When the parameter count is increased to 662M, AliTok achieves a gFID score of 1.35, surpassing the state-of-the-art diffusion method with 10x faster sampling speed.
arXiv Detail & Related papers (2025-06-05T17:45:10Z) - DiSA: Diffusion Step Annealing in Autoregressive Image Generation [35.35184094233562]
An increasing number of autoregressive models, such as MAR, FlowAR, xAR, and Harmon, adopt diffusion sampling to improve the quality of image generation.<n>This paper explores how to effectively address this issue.<n>As more tokens are generated during the autoregressive process, subsequent tokens follow more constrained distributions and are easier to sample.
arXiv Detail & Related papers (2025-05-26T17:59:57Z) - Hierarchical Masked Autoregressive Models with Low-Resolution Token Pivots [103.48424042986271]
We introduce a new autoregressive design to model a hierarchy from a few low-resolution image tokens to the typical dense image tokens.<n>We present a Hierarchical Masked Autoregressive models (Hi-MAR) that pivot on low-resolution image tokens to trigger hierarchical autoregressive modeling.
arXiv Detail & Related papers (2025-05-26T17:59:07Z) - Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer.<n>Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis.<n>In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z) - Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation [63.89280381800457]
We propose TokenBridge, which maintains the strong representation capacity of continuous tokens while preserving the modeling simplicity of discrete tokens.<n>We introduce a dimension-wise quantization strategy that independently discretizes each feature dimension, paired with a lightweight autoregressive prediction mechanism.<n>Our approach achieves reconstruction and generation quality on par with continuous methods while using standard categorical prediction.
arXiv Detail & Related papers (2025-03-20T17:59:59Z) - Improving Autoregressive Image Generation through Coarse-to-Fine Token Prediction [4.900334213807624]
We show how to enjoy the benefits of large codebooks without making autoregressive modeling more difficult.<n>Our framework consists of two stages: (1) an autoregressive model that sequentially predicts coarse labels for each token in the sequence, and (2) an auxiliary model that simultaneously predicts fine-grained labels for all tokens conditioned on their coarse labels.
arXiv Detail & Related papers (2025-03-20T14:41:29Z) - Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion [61.03681839276652]
Diffusion Forcing is a new training paradigm where a diffusion model is trained to denoise a set of tokens with independent per-token noise levels.<n>We apply Diffusion Forcing to sequence generative modeling by training a causal next-token prediction model to generate one or several future tokens.
arXiv Detail & Related papers (2024-07-01T15:43:25Z) - Autoregressive Image Generation without Vector Quantization [31.798754606008067]
Conventional wisdom holds that autoregressive models for image generation are typically accompanied by vector-quantized tokens.
We propose to model the per-token probability distribution using a diffusion procedure, which allows us to apply autoregressive models in a continuous-valued space.
arXiv Detail & Related papers (2024-06-17T17:59:58Z) - AR-Diffusion: Auto-Regressive Diffusion Model for Text Generation [138.98095392584693]
We introduce Auto-Regressive Diffusion (AR-Diffusion) to account for the inherent sequential characteristic of natural language.
AR-Diffusion ensures that the generation of tokens on the right depends on the generated ones on the left, a mechanism achieved through employing a dynamic number of denoising steps.
In a series of experiments on various text generation tasks, AR-Diffusion clearly demonstrated its superiority over existing diffusion language models.
arXiv Detail & Related papers (2023-05-16T15:10:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.