Distilled Decoding 2: One-step Sampling of Image Auto-regressive Models with Conditional Score Distillation
- URL: http://arxiv.org/abs/2510.21003v1
- Date: Thu, 23 Oct 2025 21:21:38 GMT
- Title: Distilled Decoding 2: One-step Sampling of Image Auto-regressive Models with Conditional Score Distillation
- Authors: Enshu Liu, Qian Chen, Xuefei Ning, Shengen Yan, Guohao Dai, Zinan Lin, Yu Wang,
- Abstract summary: Image Auto-regressive (AR) models suffer from slow generation speed due to the large number of sampling steps required.<n>We propose Distilled Decoding 2 (DD2) to further advances the feasibility of one-step sampling for image AR models.<n>Compared to the strongest baseline DD1, DD2 reduces the gap between the one-step sampling and original AR model by 67%, with up to 12.3$times$ training speed-up simultaneously.
- Score: 34.82072097985874
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image Auto-regressive (AR) models have emerged as a powerful paradigm of visual generative models. Despite their promising performance, they suffer from slow generation speed due to the large number of sampling steps required. Although Distilled Decoding 1 (DD1) was recently proposed to enable few-step sampling for image AR models, it still incurs significant performance degradation in the one-step setting, and relies on a pre-defined mapping that limits its flexibility. In this work, we propose a new method, Distilled Decoding 2 (DD2), to further advances the feasibility of one-step sampling for image AR models. Unlike DD1, DD2 does not without rely on a pre-defined mapping. We view the original AR model as a teacher model which provides the ground truth conditional score in the latent embedding space at each token position. Based on this, we propose a novel \emph{conditional score distillation loss} to train a one-step generator. Specifically, we train a separate network to predict the conditional score of the generated distribution and apply score distillation at every token position conditioned on previous tokens. Experimental results show that DD2 enables one-step sampling for image AR models with an minimal FID increase from 3.40 to 5.43 on ImageNet-256. Compared to the strongest baseline DD1, DD2 reduces the gap between the one-step sampling and original AR model by 67%, with up to 12.3$\times$ training speed-up simultaneously. DD2 takes a significant step toward the goal of one-step AR generation, opening up new possibilities for fast and high-quality AR modeling. Code is available at https://github.com/imagination-research/Distilled-Decoding-2.
Related papers
- DetailFlow: 1D Coarse-to-Fine Autoregressive Image Generation via Next-Detail Prediction [47.483590046908844]
This paper presents DetailFlow, a coarse-to-fine 1D autoregressive (AR) image generation method.<n>By learning a resolution-aware token sequence supervised with progressively degraded images, DetailFlow enables the generation process to start from the global structure.<n>Our method achieves high-quality image synthesis with significantly fewer tokens than previous approaches.
arXiv Detail & Related papers (2025-05-27T17:45:21Z) - Autoregressive Distillation of Diffusion Transformers [18.19070958829772]
We propose AutoRegressive Distillation (ARD), a novel approach that leverages the historical trajectory of the ODE to predict future steps.<n>ARD offers two key benefits: 1) it mitigates exposure bias by utilizing a predicted historical trajectory that is less susceptible to accumulated errors, and 2) it leverages the previous history of the ODE trajectory as a more effective source of coarse-grained information.<n>Our model achieves a $5times$ reduction in FID degradation compared to the baseline methods while requiring only 1.1% extra FLOPs on ImageNet-256.
arXiv Detail & Related papers (2025-04-15T15:33:49Z) - FastVAR: Linear Visual Autoregressive Modeling via Cached Token Pruning [66.5214586624095]
Existing Visual Autoregressive ( VAR) paradigms process the entire token map at each scale step, leading to the complexity and runtime scaling dramatically with image resolution.<n>We propose Fastmore, a post-training acceleration method for efficient resolution scaling with VARs.<n> Experiments show Fastmore can further speedup FlashAttention-accelerated VAR by 2.7$times$ with negligible performance drop of 1%.
arXiv Detail & Related papers (2025-03-30T08:51:19Z) - Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching [12.985270202599814]
Autoregressive (AR) models have achieved state-of-the-art performance in text and image generation but suffer from slow generation due to the token-by-token process.<n>We ask an ambitious question: can a pre-trained AR model be adapted to generate outputs in just one or two steps?<n>We propose Distilled Decoding (DD), which uses flow matching to create a deterministic mapping from Gaussian distribution to the output distribution of the pre-trained AR model.
arXiv Detail & Related papers (2024-12-22T20:21:54Z) - Adversarial Score identity Distillation: Rapidly Surpassing the Teacher in One Step [64.53013367995325]
We introduce SiDA (SiD with Adversarial Loss), which improves generation quality and distillation efficiency.<n>SiDA incorporates real images and adversarial loss, allowing it to distinguish between real images and those generated by SiD.<n>SiDA converges significantly faster than its predecessor when distilled from scratch.
arXiv Detail & Related papers (2024-10-19T00:33:51Z) - Truncated Consistency Models [57.50243901368328]
Training consistency models requires learning to map all intermediate points along PF ODE trajectories to their corresponding endpoints.<n>We empirically find that this training paradigm limits the one-step generation performance of consistency models.<n>We propose a new parameterization of the consistency function and a two-stage training procedure that prevents the truncated-time training from collapsing to a trivial solution.
arXiv Detail & Related papers (2024-10-18T22:38:08Z) - Improved Distribution Matching Distillation for Fast Image Synthesis [54.72356560597428]
We introduce DMD2, a set of techniques that lift this limitation and improve DMD training.
First, we eliminate the regression loss and the need for expensive dataset construction.
Second, we integrate a GAN loss into the distillation procedure, discriminating between generated samples and real images.
arXiv Detail & Related papers (2024-05-23T17:59:49Z) - Directly Denoising Diffusion Models [6.109141407163027]
We present Directly Denoising Diffusion Model (DDDM), a simple and generic approach for generating realistic images with few-step sampling.
Our model achieves FID scores of 2.57 and 2.33 on CIFAR-10 in one-step and two-step sampling respectively, surpassing those obtained from GANs and distillation-based models.
For ImageNet 64x64, our approach stands as a competitive contender against leading models.
arXiv Detail & Related papers (2024-05-22T11:20:32Z) - Adversarial Diffusion Distillation [18.87099764514747]
Adversarial Diffusion Distillation (ADD) is a novel training approach that efficiently samples large-scale foundational image diffusion models in just 1-4 steps.
We use score distillation to leverage large-scale off-the-shelf image diffusion models as a teacher signal.
Our model clearly outperforms existing few-step methods in a single step and reaches the performance of state-of-the-art diffusion models (SDXL) in only four steps.
arXiv Detail & Related papers (2023-11-28T18:53:24Z) - Consistency Models [89.68380014789861]
We propose a new family of models that generate high quality samples by directly mapping noise to data.
They support fast one-step generation by design, while still allowing multistep sampling to trade compute for sample quality.
They also support zero-shot data editing, such as image inpainting, colorization, and super-resolution, without requiring explicit training.
arXiv Detail & Related papers (2023-03-02T18:30:16Z) - On Distillation of Guided Diffusion Models [94.95228078141626]
We propose an approach to distilling classifier-free guided diffusion models into models that are fast to sample from.
For standard diffusion models trained on the pixelspace, our approach is able to generate images visually comparable to that of the original model.
For diffusion models trained on the latent-space (e.g., Stable Diffusion), our approach is able to generate high-fidelity images using as few as 1 to 4 denoising steps.
arXiv Detail & Related papers (2022-10-06T18:03:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.