PaGoDA: Progressive Growing of a One-Step Generator from a Low-Resolution Diffusion Teacher
- URL: http://arxiv.org/abs/2405.14822v2
- Date: Tue, 29 Oct 2024 15:26:00 GMT
- Title: PaGoDA: Progressive Growing of a One-Step Generator from a Low-Resolution Diffusion Teacher
- Authors: Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Yuhta Takida, Naoki Murata, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano Ermon,
- Abstract summary: PaGoDA is a novel pipeline that reduces the training costs through three stages: training diffusion on downsampled data, distilling the pretrained diffusion, and progressive super-resolution.
With the proposed pipeline, PaGoDA achieves a $64times$ reduced cost in training its diffusion model on 8x downsampled data.
PaGoDA's pipeline can be applied directly in the latent space, adding compression alongside the pre-trained autoencoder in Latent Diffusion Models.
- Score: 55.22994720855957
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The diffusion model performs remarkable in generating high-dimensional content but is computationally intensive, especially during training. We propose Progressive Growing of Diffusion Autoencoder (PaGoDA), a novel pipeline that reduces the training costs through three stages: training diffusion on downsampled data, distilling the pretrained diffusion, and progressive super-resolution. With the proposed pipeline, PaGoDA achieves a $64\times$ reduced cost in training its diffusion model on 8x downsampled data; while at the inference, with the single-step, it performs state-of-the-art on ImageNet across all resolutions from 64x64 to 512x512, and text-to-image. PaGoDA's pipeline can be applied directly in the latent space, adding compression alongside the pre-trained autoencoder in Latent Diffusion Models (e.g., Stable Diffusion). The code is available at https://github.com/sony/pagoda.
Related papers
- Fine-structure Preserved Real-world Image Super-resolution via Transfer VAE Training [14.058527210122831]
We propose a Transfer VAE Training (TVT) strategy to transfer the 8$times$ downsampled VAE into a 4$times$ one while adapting to the pre-trained UNet.<n>TVT strategy aligns the new encoder-decoder pair with the original VAE latent space while enhancing image fine details.
arXiv Detail & Related papers (2025-07-27T14:11:29Z) - SenseFlow: Scaling Distribution Matching for Flow-based Text-to-Image Distillation [12.842428916585217]
The Distribution Matching Distillation (DMD) has been successfully applied to text-to-image diffusion models such as Stable Diffusion (SD) 1.5.<n>However, vanilla DMD suffers from convergence difficulties on large-scale flow-based text-to-image models, such as SD 3.5 and FLUX.
arXiv Detail & Related papers (2025-05-31T11:59:02Z) - ProReflow: Progressive Reflow with Decomposed Velocity [52.249464542399636]
Flow matching aims to reflow the diffusion process of diffusion models into a straight line for a few-step and even one-step generation.
We introduce progressive reflow, which progressively reflows the diffusion models in local timesteps until the whole diffusion progresses.
We also introduce aligned v-prediction, which highlights the importance of direction matching in flow matching over magnitude matching.
arXiv Detail & Related papers (2025-03-05T04:50:53Z) - Diffusion Models Need Visual Priors for Image Generation [86.92260591389818]
Diffusion on Diffusion (DoD) is an innovative multi-stage generation framework that first extracts visual priors from previously generated samples, then provides rich guidance for the diffusion model.
We evaluate DoD on the popular ImageNet-$256 times 256$ dataset, reducing 7$times$ training cost compared to SiT and DiT.
Our largest model DoD-XL achieves an FID-50K score of 1.83 with only 1 million training steps, which surpasses other state-of-the-art methods without bells and whistles during inference.
arXiv Detail & Related papers (2024-10-11T05:03:56Z) - Unleashing the Power of One-Step Diffusion based Image Super-Resolution via a Large-Scale Diffusion Discriminator [81.81748032199813]
Diffusion models have demonstrated excellent performance for real-world image super-resolution (Real-ISR)
We propose a new One-Step textbfDiffusion model with a larger-scale textbfDiscriminator for SR.
Our discriminator is able to distill noisy features from any time step of diffusion models in the latent space.
arXiv Detail & Related papers (2024-10-05T16:41:36Z) - Accelerating Parallel Sampling of Diffusion Models [25.347710690711562]
We propose a novel approach that accelerates the sampling of diffusion models by parallelizing the autoregressive process.
Applying these techniques, we introduce ParaTAA, a universal and training-free parallel sampling algorithm.
Our experiments demonstrate that ParaTAA can decrease the inference steps required by common sequential sampling algorithms by a factor of 4$sim$14 times.
arXiv Detail & Related papers (2024-02-15T14:27:58Z) - HiPA: Enabling One-Step Text-to-Image Diffusion Models via
High-Frequency-Promoting Adaptation [47.43155993432259]
High-frequency-Promoting Adaptation (HiPA) is a parameter-efficient approach to enable one-step text-to-image diffusion.
HiPA focuses on training one-step, low-rank adaptors to specifically enhance the under-represented high-frequency abilities of advanced diffusion models.
Compared with progressive distillation, HiPA achieves much better performance in one-step text-to-image generation.
arXiv Detail & Related papers (2023-11-30T00:14:07Z) - ACT-Diffusion: Efficient Adversarial Consistency Training for One-step Diffusion Models [59.90959789767886]
We show that optimizing consistency training loss minimizes the Wasserstein distance between target and generated distributions.
By incorporating a discriminator into the consistency training framework, our method achieves improved FID scores on CIFAR10 and ImageNet 64$times$64 and LSUN Cat 256$times$256 datasets.
arXiv Detail & Related papers (2023-11-23T16:49:06Z) - SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two
Seconds [88.06788636008051]
Text-to-image diffusion models can create stunning images from natural language descriptions that rival the work of professional artists and photographers.
These models are large, with complex network architectures and tens of denoising iterations, making them computationally expensive and slow to run.
We present a generic approach that unlocks running text-to-image diffusion models on mobile devices in less than $2$ seconds.
arXiv Detail & Related papers (2023-06-01T17:59:25Z) - Patch Diffusion: Faster and More Data-Efficient Training of Diffusion
Models [166.64847903649598]
We propose Patch Diffusion, a generic patch-wise training framework.
Patch Diffusion significantly reduces the training time costs while improving data efficiency.
We achieve outstanding FID scores in line with state-of-the-art benchmarks.
arXiv Detail & Related papers (2023-04-25T02:35:54Z) - Variational Diffusion Auto-encoder: Latent Space Extraction from
Pre-trained Diffusion Models [0.0]
Variational Auto-Encoders (VAEs) face challenges with the quality of generated images, often presenting noticeable blurriness.
This issue stems from the unrealistic assumption that approximates the conditional data distribution, $p(textbfx | textbfz)$, as an isotropic Gaussian.
We illustrate how one can extract a latent space from a pre-existing diffusion model by optimizing an encoder to maximize the marginal data log-likelihood.
arXiv Detail & Related papers (2023-04-24T14:44:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.