Related papers: PaGoDA: Progressive Growing of a One-Step Generator from a Low-Resolution Diffusion Teacher

PaGoDA: Progressive Growing of a One-Step Generator from a Low-Resolution Diffusion Teacher

URL: http://arxiv.org/abs/2405.14822v1
Date: Thu, 23 May 2024 17:39:09 GMT
Title: PaGoDA: Progressive Growing of a One-Step Generator from a Low-Resolution Diffusion Teacher
Authors: Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Yuhta Takida, Naoki Murata, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano Ermon,
Abstract summary: PaGoDA is a technique to progressively grow the resolution of the generator beyond that of the original teacher DM. We demonstrate PaGoDA's effectiveness in solving inverse problems and enabling controllable generation.
Score: 55.22994720855957
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: To accelerate sampling, diffusion models (DMs) are often distilled into generators that directly map noise to data in a single step. In this approach, the resolution of the generator is fundamentally limited by that of the teacher DM. To overcome this limitation, we propose Progressive Growing of Diffusion Autoencoder (PaGoDA), a technique to progressively grow the resolution of the generator beyond that of the original teacher DM. Our key insight is that a pre-trained, low-resolution DM can be used to deterministically encode high-resolution data to a structured latent space by solving the PF-ODE forward in time (data-to-noise), starting from an appropriately down-sampled image. Using this frozen encoder in an auto-encoder framework, we train a decoder by progressively growing its resolution. From the nature of progressively growing decoder, PaGoDA avoids re-training teacher/student models when we upsample the student model, making the whole training pipeline much cheaper. In experiments, we used our progressively growing decoder to upsample from the pre-trained model's 64x64 resolution to generate 512x512 samples, achieving 2x faster inference compared to single-step distilled Stable Diffusion like LCM. PaGoDA also achieved state-of-the-art FIDs on ImageNet across all resolutions from 64x64 to 512x512. Additionally, we demonstrated PaGoDA's effectiveness in solving inverse problems and enabling controllable generation.

Related papers

Fine-structure Preserved Real-world Image Super-resolution via Transfer VAE Training [14.058527210122831]
We propose a Transfer VAE Training (TVT) strategy to transfer the 8$times$ downsampled VAE into a 4$times$ one while adapting to the pre-trained UNet.<n>TVT strategy aligns the new encoder-decoder pair with the original VAE latent space while enhancing image fine details.
arXiv Detail & Related papers (2025-07-27T14:11:29Z)
SenseFlow: Scaling Distribution Matching for Flow-based Text-to-Image Distillation [12.842428916585217]
The Distribution Matching Distillation (DMD) has been successfully applied to text-to-image diffusion models such as Stable Diffusion (SD) 1.5.<n>However, vanilla DMD suffers from convergence difficulties on large-scale flow-based text-to-image models, such as SD 3.5 and FLUX.
arXiv Detail & Related papers (2025-05-31T11:59:02Z)
ProReflow: Progressive Reflow with Decomposed Velocity [52.249464542399636]
Flow matching aims to reflow the diffusion process of diffusion models into a straight line for a few-step and even one-step generation. We introduce progressive reflow, which progressively reflows the diffusion models in local timesteps until the whole diffusion progresses. We also introduce aligned v-prediction, which highlights the importance of direction matching in flow matching over magnitude matching.
arXiv Detail & Related papers (2025-03-05T04:50:53Z)
Diffusion Models Need Visual Priors for Image Generation [86.92260591389818]
Diffusion on Diffusion (DoD) is an innovative multi-stage generation framework that first extracts visual priors from previously generated samples, then provides rich guidance for the diffusion model. We evaluate DoD on the popular ImageNet-$256 times 256$ dataset, reducing 7$times$ training cost compared to SiT and DiT. Our largest model DoD-XL achieves an FID-50K score of 1.83 with only 1 million training steps, which surpasses other state-of-the-art methods without bells and whistles during inference.
arXiv Detail & Related papers (2024-10-11T05:03:56Z)
Unleashing the Power of One-Step Diffusion based Image Super-Resolution via a Large-Scale Diffusion Discriminator [81.81748032199813]
Diffusion models have demonstrated excellent performance for real-world image super-resolution (Real-ISR) We propose a new One-Step textbfDiffusion model with a larger-scale textbfDiscriminator for SR. Our discriminator is able to distill noisy features from any time step of diffusion models in the latent space.
arXiv Detail & Related papers (2024-10-05T16:41:36Z)
Accelerating Parallel Sampling of Diffusion Models [25.347710690711562]
We propose a novel approach that accelerates the sampling of diffusion models by parallelizing the autoregressive process. Applying these techniques, we introduce ParaTAA, a universal and training-free parallel sampling algorithm. Our experiments demonstrate that ParaTAA can decrease the inference steps required by common sequential sampling algorithms by a factor of 4$sim$14 times.
arXiv Detail & Related papers (2024-02-15T14:27:58Z)
HiPA: Enabling One-Step Text-to-Image Diffusion Models via High-Frequency-Promoting Adaptation [47.43155993432259]
High-frequency-Promoting Adaptation (HiPA) is a parameter-efficient approach to enable one-step text-to-image diffusion. HiPA focuses on training one-step, low-rank adaptors to specifically enhance the under-represented high-frequency abilities of advanced diffusion models. Compared with progressive distillation, HiPA achieves much better performance in one-step text-to-image generation.
arXiv Detail & Related papers (2023-11-30T00:14:07Z)
ACT-Diffusion: Efficient Adversarial Consistency Training for One-step Diffusion Models [59.90959789767886]
We show that optimizing consistency training loss minimizes the Wasserstein distance between target and generated distributions. By incorporating a discriminator into the consistency training framework, our method achieves improved FID scores on CIFAR10 and ImageNet 64$times$64 and LSUN Cat 256$times$256 datasets.
arXiv Detail & Related papers (2023-11-23T16:49:06Z)
SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds [88.06788636008051]
Text-to-image diffusion models can create stunning images from natural language descriptions that rival the work of professional artists and photographers. These models are large, with complex network architectures and tens of denoising iterations, making them computationally expensive and slow to run. We present a generic approach that unlocks running text-to-image diffusion models on mobile devices in less than $2$ seconds.
arXiv Detail & Related papers (2023-06-01T17:59:25Z)
Patch Diffusion: Faster and More Data-Efficient Training of Diffusion Models [166.64847903649598]
We propose Patch Diffusion, a generic patch-wise training framework. Patch Diffusion significantly reduces the training time costs while improving data efficiency. We achieve outstanding FID scores in line with state-of-the-art benchmarks.
arXiv Detail & Related papers (2023-04-25T02:35:54Z)
Variational Diffusion Auto-encoder: Latent Space Extraction from Pre-trained Diffusion Models [0.0]
Variational Auto-Encoders (VAEs) face challenges with the quality of generated images, often presenting noticeable blurriness. This issue stems from the unrealistic assumption that approximates the conditional data distribution, $p(textbfx | textbfz)$, as an isotropic Gaussian. We illustrate how one can extract a latent space from a pre-existing diffusion model by optimizing an encoder to maximize the marginal data log-likelihood.
arXiv Detail & Related papers (2023-04-24T14:44:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.