Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching
- URL: http://arxiv.org/abs/2412.17153v2
- Date: Tue, 24 Dec 2024 04:21:15 GMT
- Title: Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching
- Authors: Enshu Liu, Xuefei Ning, Yu Wang, Zinan Lin,
- Abstract summary: Autoregressive (AR) models have achieved state-of-the-art performance in text and image generation but suffer from slow generation due to the token-by-token process.<n>We ask an ambitious question: can a pre-trained AR model be adapted to generate outputs in just one or two steps?<n>We propose Distilled Decoding (DD), which uses flow matching to create a deterministic mapping from Gaussian distribution to the output distribution of the pre-trained AR model.
- Score: 12.985270202599814
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autoregressive (AR) models have achieved state-of-the-art performance in text and image generation but suffer from slow generation due to the token-by-token process. We ask an ambitious question: can a pre-trained AR model be adapted to generate outputs in just one or two steps? If successful, this would significantly advance the development and deployment of AR models. We notice that existing works that try to speed up AR generation by generating multiple tokens at once fundamentally cannot capture the output distribution due to the conditional dependencies between tokens, limiting their effectiveness for few-step generation. To address this, we propose Distilled Decoding (DD), which uses flow matching to create a deterministic mapping from Gaussian distribution to the output distribution of the pre-trained AR model. We then train a network to distill this mapping, enabling few-step generation. DD doesn't need the training data of the original AR model, making it more practical. We evaluate DD on state-of-the-art image AR models and present promising results on ImageNet-256. For VAR, which requires 10-step generation, DD enables one-step generation (6.3$\times$ speed-up), with an acceptable increase in FID from 4.19 to 9.96. For LlamaGen, DD reduces generation from 256 steps to 1, achieving an 217.8$\times$ speed-up with a comparable FID increase from 4.11 to 11.35. In both cases, baseline methods completely fail with FID>100. DD also excels on text-to-image generation, reducing the generation from 256 steps to 2 for LlamaGen with minimal FID increase from 25.70 to 28.95. As the first work to demonstrate the possibility of one-step generation for image AR models, DD challenges the prevailing notion that AR models are inherently slow, and opens up new opportunities for efficient AR generation. The project website is at https://imagination-research.github.io/distilled-decoding.
Related papers
- SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL [112.92522479863054]
This work presents SimpleAR, a vanilla autoregressive visual generation framework without complex architecure modifications.
We demonstrate that our model can generate 1024x1024 resolution images with high fidelity, and achieve competitive results on text-to-image benchmarks.
By sharing these findings and open-sourcing the code, we hope to reveal the potential of autoregressive visual generation.
arXiv Detail & Related papers (2025-04-15T17:59:46Z) - Autoregressive Distillation of Diffusion Transformers [18.19070958829772]
We propose AutoRegressive Distillation (ARD), a novel approach that leverages the historical trajectory of the ODE to predict future steps.
ARD offers two key benefits: 1) it mitigates exposure bias by utilizing a predicted historical trajectory that is less susceptible to accumulated errors, and 2) it leverages the previous history of the ODE trajectory as a more effective source of coarse-grained information.
Our model achieves a $5times$ reduction in FID degradation compared to the baseline methods while requiring only 1.1% extra FLOPs on ImageNet-256.
arXiv Detail & Related papers (2025-04-15T15:33:49Z) - FastVAR: Linear Visual Autoregressive Modeling via Cached Token Pruning [66.5214586624095]
Existing Visual Autoregressive ( VAR) paradigms process the entire token map at each scale step, leading to the complexity and runtime scaling dramatically with image resolution.
We propose Fastmore, a post-training acceleration method for efficient resolution scaling with VARs.
Experiments show Fastmore can further speedup FlashAttention-accelerated VAR by 2.7$times$ with negligible performance drop of 1%.
arXiv Detail & Related papers (2025-03-30T08:51:19Z) - ARINAR: Bi-Level Autoregressive Feature-by-Feature Generative Models [37.65992612575692]
ARINAR-B with 213M parameters achieves an FID of 2.75, which is comparable to the state-of-the-art MAR-B model (FID=2.31), while five times faster than the latter.
ARINAR-B with 213M parameters achieves an FID of 2.75, which is comparable to the state-of-the-art MAR-B model (FID=2.31), while five times faster than the latter.
arXiv Detail & Related papers (2025-03-04T18:59:56Z) - RandAR: Decoder-only Autoregressive Visual Generation in Random Orders [54.49937384788739]
RandAR is a decoder-only visual autoregressive (AR) model capable of generating images in arbitrary token orders.<n>Our design enables random order by inserting a "position instruction token" before each image token to be predicted.<n>RandAR supports inpainting, outpainting and resolution inference in a zero-shot manner.
arXiv Detail & Related papers (2024-12-02T18:59:53Z) - Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient [52.96232442322824]
Collaborative Decoding (CoDe) is a novel efficient decoding strategy tailored for the Visual Auto-Regressive ( VAR) framework.<n>CoDe capitalizes on two critical observations: the substantially reduced parameter demands at larger scales and the exclusive generation patterns across different scales.<n>CoDe achieves a 1.7x speedup, slashes memory usage by around 50%, and preserves image quality with only a negligible FID increase from 1.95 to 1.98.
arXiv Detail & Related papers (2024-11-26T15:13:15Z) - Adversarial Score identity Distillation: Rapidly Surpassing the Teacher in One Step [64.53013367995325]
identity Distillation (SiD) is a data-free method that has achieved SOTA performance in image generation by leveraging only a pretrained diffusion model.
We introduce SiDA (SiD with Adversarial Loss), which not only enhances generation quality but also improves distillation efficiency by incorporating real images and adversarial loss.
arXiv Detail & Related papers (2024-10-19T00:33:51Z) - LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding [30.630803933771865]
Experimental results demonstrate the efficacy of our method in providing a substantial speed-up over speculative decoding.
LANTERN increases speed-ups by $mathbf1.75times$ and $mathbf1.82times$, as compared to greedy decoding and random sampling.
arXiv Detail & Related papers (2024-10-04T12:21:03Z) - Generative Dataset Distillation Based on Diffusion Model [45.305885410046116]
We propose a novel generative dataset distillation method based on Stable Diffusion.
Specifically, we use the SDXL-Turbo model which can generate images at high speed and quality.
We achieved third place in the generative track of the ECCV 2024 DD Challenge.
arXiv Detail & Related papers (2024-08-16T08:52:02Z) - Diffusion Models Are Innate One-Step Generators [2.3359837623080613]
Diffusion Models (DMs) can generate remarkable high-quality results.
DMs' layers are differentially activated at different time steps, leading to an inherent capability to generate images in a single step.
Our method achieves the SOTA results on CIFAR-10, AFHQv2 64x64 (FID 1.23), FFHQ 64x64 (FID 0.85) and ImageNet 64x64 (FID 1.16) with great efficiency.
arXiv Detail & Related papers (2024-05-31T11:14:12Z) - Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction [33.57820997288788]
We present a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine "next-scale prediction"
Visual AutoRegressive modeling makes GPT-like AR models surpass diffusion transformers in image generation.
We have released all models and codes to promote the exploration of AR/token models for visual generation and unified learning.
arXiv Detail & Related papers (2024-04-03T17:59:53Z) - Emage: Non-Autoregressive Text-to-Image Generation [63.347052548210236]
Non-autoregressive text-to-image models efficiently generate hundreds of image tokens in parallel.
Our model with 346M parameters generates an image of 256$times$256 with about one second on one V100 GPU.
arXiv Detail & Related papers (2023-12-22T10:01:54Z) - Consistency Models [89.68380014789861]
We propose a new family of models that generate high quality samples by directly mapping noise to data.
They support fast one-step generation by design, while still allowing multistep sampling to trade compute for sample quality.
They also support zero-shot data editing, such as image inpainting, colorization, and super-resolution, without requiring explicit training.
arXiv Detail & Related papers (2023-03-02T18:30:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.