Related papers: Autoregressive Distillation of Diffusion Transformers

Autoregressive Distillation of Diffusion Transformers

URL: http://arxiv.org/abs/2504.11295v1
Date: Tue, 15 Apr 2025 15:33:49 GMT
Title: Autoregressive Distillation of Diffusion Transformers
Authors: Yeongmin Kim, Sotiris Anagnostidis, Yuming Du, Edgar Schönfeld, Jonas Kohler, Markos Georgopoulos, Albert Pumarola, Ali Thabet, Artsiom Sanakoyeu,
Abstract summary: We propose AutoRegressive Distillation (ARD), a novel approach that leverages the historical trajectory of the ODE to predict future steps.<n>ARD offers two key benefits: 1) it mitigates exposure bias by utilizing a predicted historical trajectory that is less susceptible to accumulated errors, and 2) it leverages the previous history of the ODE trajectory as a more effective source of coarse-grained information.<n>Our model achieves a $5times$ reduction in FID degradation compared to the baseline methods while requiring only 1.1% extra FLOPs on ImageNet-256.
Score: 18.19070958829772
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion models with transformer architectures have demonstrated promising capabilities in generating high-fidelity images and scalability for high resolution. However, iterative sampling process required for synthesis is very resource-intensive. A line of work has focused on distilling solutions to probability flow ODEs into few-step student models. Nevertheless, existing methods have been limited by their reliance on the most recent denoised samples as input, rendering them susceptible to exposure bias. To address this limitation, we propose AutoRegressive Distillation (ARD), a novel approach that leverages the historical trajectory of the ODE to predict future steps. ARD offers two key benefits: 1) it mitigates exposure bias by utilizing a predicted historical trajectory that is less susceptible to accumulated errors, and 2) it leverages the previous history of the ODE trajectory as a more effective source of coarse-grained information. ARD modifies the teacher transformer architecture by adding token-wise time embedding to mark each input from the trajectory history and employs a block-wise causal attention mask for training. Furthermore, incorporating historical inputs only in lower transformer layers enhances performance and efficiency. We validate the effectiveness of ARD in a class-conditioned generation on ImageNet and T2I synthesis. Our model achieves a $5\times$ reduction in FID degradation compared to the baseline methods while requiring only 1.1\% extra FLOPs on ImageNet-256. Moreover, ARD reaches FID of 1.84 on ImageNet-256 in merely 4 steps and outperforms the publicly available 1024p text-to-image distilled models in prompt adherence score with a minimal drop in FID compared to the teacher. Project page: https://github.com/alsdudrla10/ARD.

Related papers

One Diffusion Step to Real-World Super-Resolution via Flow Trajectory Distillation [60.54811860967658]
FluxSR is a novel one-step diffusion Real-ISR based on flow matching models.<n>First, we introduce Flow Trajectory Distillation (FTD) to distill a multi-step flow matching model into a one-step Real-ISR.<n>Second, to improve image realism and address high-frequency artifact issues in generated images, we propose TV-LPIPS as a perceptual loss.
arXiv Detail & Related papers (2025-02-04T04:11:29Z)
TREAD: Token Routing for Efficient Architecture-agnostic Diffusion Training [20.712053538676116]
Diffusion models typically suffer from sample inefficiency and high training costs.<n>We show that TREAD reduces computational cost and simultaneously boosts model performance.<n>We achieve a competitive FID of 2.09 in a guided and 3.93 in an unguided setting.
arXiv Detail & Related papers (2025-01-08T18:38:25Z)
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models [33.519892081718716]
We propose aligning the latent space with pre-trained vision foundation models when training the visual tokenizers. Our proposed VA-VAE significantly expands the reconstruction-generation frontier of latent diffusion models. We build an enhanced DiT baseline with improved training strategies and architecture designs, termed LightningDiT.
arXiv Detail & Related papers (2025-01-02T18:59:40Z)
When Worse is Better: Navigating the compression-generation tradeoff in visual tokenization [92.17160980120404]
We introduce Causally Regularized Tokenization (CRT), which uses knowledge of the stage 2 generation modeling procedure to embed useful inductive biases in stage 1 latents. CRT makes stage 1 reconstruction performance worse, but makes stage 2 generation performance better by making the tokens easier to model. We match state-of-the-art discrete autoregressive ImageNet generation (2.18 FID) with less than half the tokens per image.
arXiv Detail & Related papers (2024-12-20T20:32:02Z)
OFTSR: One-Step Flow for Image Super-Resolution with Tunable Fidelity-Realism Trade-offs [20.652907645817713]
OFTSR is a flow-based framework for one-step image super-resolution that can produce outputs with tunable levels of fidelity and realism.<n>We demonstrate that OFTSR achieves state-of-the-art performance for one-step image super-resolution, while having the ability to flexibly tune the fidelity-realism trade-off.
arXiv Detail & Related papers (2024-12-12T17:14:58Z)
Adversarial Score identity Distillation: Rapidly Surpassing the Teacher in One Step [64.53013367995325]
We introduce SiDA (SiD with Adversarial Loss), which improves generation quality and distillation efficiency.<n>SiDA incorporates real images and adversarial loss, allowing it to distinguish between real images and those generated by SiD.<n>SiDA converges significantly faster than its predecessor when distilled from scratch.
arXiv Detail & Related papers (2024-10-19T00:33:51Z)
Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models and Time-Dependent Layer Normalization [26.926712014346432]
This paper presents innovative enhancements to diffusion models by integrating a novel multi-resolution network and time-dependent layer normalization.<n>Our method's efficacy is demonstrated on the class-conditional ImageNet generation benchmark, setting new state-of-the-art FID scores of 1.70 on ImageNet 256 x 256 and 2.89 on ImageNet 512 x 512.
arXiv Detail & Related papers (2024-06-13T17:59:58Z)
Improved Distribution Matching Distillation for Fast Image Synthesis [54.72356560597428]
We introduce DMD2, a set of techniques that lift this limitation and improve DMD training. First, we eliminate the regression loss and the need for expensive dataset construction. Second, we integrate a GAN loss into the distillation procedure, discriminating between generated samples and real images.
arXiv Detail & Related papers (2024-05-23T17:59:49Z)
SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation [1.5892730797514436]
Text-to-image diffusion models often suffer from slow iterative sampling processes. We present a novel image-free distillation scheme named $textbfSwiftBrush$. SwiftBrush achieves an FID score of $textbf16.67$ and a CLIP score of $textbf0.29$ on the COCO-30K benchmark.
arXiv Detail & Related papers (2023-12-08T18:44:09Z)
BOOT: Data-free Distillation of Denoising Diffusion Models with Bootstrapping [64.54271680071373]
Diffusion models have demonstrated excellent potential for generating diverse images. Knowledge distillation has been recently proposed as a remedy that can reduce the number of inference steps to one or a few. We present a novel technique called BOOT, that overcomes limitations with an efficient data-free distillation algorithm.
arXiv Detail & Related papers (2023-06-08T20:30:55Z)
Always Be Dreaming: A New Approach for Data-Free Class-Incremental Learning [73.24988226158497]
We consider the high-impact problem of Data-Free Class-Incremental Learning (DFCIL) We propose a novel incremental distillation strategy for DFCIL, contributing a modified cross-entropy training and importance-weighted feature distillation. Our method results in up to a 25.1% increase in final task accuracy (absolute difference) compared to SOTA DFCIL methods for common class-incremental benchmarks.
arXiv Detail & Related papers (2021-06-17T17:56:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.