Masked Diffusion Models Are Fast Distribution Learners
- URL: http://arxiv.org/abs/2306.11363v4
- Date: Mon, 27 Nov 2023 11:34:52 GMT
- Title: Masked Diffusion Models Are Fast Distribution Learners
- Authors: Jiachen Lei, Qinglong Wang, Peng Cheng, Zhongjie Ba, Zhan Qin, Zhibo
Wang, Zhenguang Liu, Kui Ren
- Abstract summary: Diffusion models are commonly trained to learn all fine-grained visual information from scratch.
We show that it suffices to train a strong diffusion model by first pre-training the model to learn some primer distribution.
Then the pre-trained model can be fine-tuned for various generation tasks efficiently.
- Score: 32.485235866596064
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Diffusion model has emerged as the \emph{de-facto} model for image
generation, yet the heavy training overhead hinders its broader adoption in the
research community. We observe that diffusion models are commonly trained to
learn all fine-grained visual information from scratch. This paradigm may cause
unnecessary training costs hence requiring in-depth investigation. In this
work, we show that it suffices to train a strong diffusion model by first
pre-training the model to learn some primer distribution that loosely
characterizes the unknown real image distribution. Then the pre-trained model
can be fine-tuned for various generation tasks efficiently. In the pre-training
stage, we propose to mask a high proportion (e.g., up to 90\%) of input images
to approximately represent the primer distribution and introduce a masked
denoising score matching objective to train a model to denoise visible areas.
In subsequent fine-tuning stage, we efficiently train diffusion model without
masking. Utilizing the two-stage training framework, we achieves significant
training acceleration and a new FID score record of 6.27 on CelebA-HQ $256
\times 256$ for ViT-based diffusion models. The generalizability of a
pre-trained model further helps building models that perform better than ones
trained from scratch on different downstream datasets. For instance, a
diffusion model pre-trained on VGGFace2 attains a 46\% quality improvement when
fine-tuned on a different dataset that contains only 3000 images. Our code is
available at \url{https://github.com/jiachenlei/maskdm}.
Related papers
- Learning on Less: Constraining Pre-trained Model Learning for Generalizable Diffusion-Generated Image Detection [13.610095493539394]
Diffusion Models enable realistic image generation, raising the risk of misinformation and eroding public trust.
Currently, detecting images generated by unseen diffusion models remains challenging due to the limited generalization capabilities of existing methods.
We propose a simple yet effective training method called Learning on Less (LoL)
arXiv Detail & Related papers (2024-12-01T04:01:43Z) - Diffusion Models Need Visual Priors for Image Generation [86.92260591389818]
Diffusion on Diffusion (DoD) is an innovative multi-stage generation framework that first extracts visual priors from previously generated samples, then provides rich guidance for the diffusion model.
We evaluate DoD on the popular ImageNet-$256 times 256$ dataset, reducing 7$times$ training cost compared to SiT and DiT.
Our largest model DoD-XL achieves an FID-50K score of 1.83 with only 1 million training steps, which surpasses other state-of-the-art methods without bells and whistles during inference.
arXiv Detail & Related papers (2024-10-11T05:03:56Z) - FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models [56.71672127740099]
We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets.
We leverage different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation.
Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets.
arXiv Detail & Related papers (2024-03-29T10:38:25Z) - Training Class-Imbalanced Diffusion Model Via Overlap Optimization [55.96820607533968]
Diffusion models trained on real-world datasets often yield inferior fidelity for tail classes.
Deep generative models, including diffusion models, are biased towards classes with abundant training images.
We propose a method based on contrastive learning to minimize the overlap between distributions of synthetic images for different classes.
arXiv Detail & Related papers (2024-02-16T16:47:21Z) - Conditional Image Generation with Pretrained Generative Model [1.4685355149711303]
diffusion models have gained popularity for their ability to generate higher-quality images in comparison to GAN models.
These models require a huge amount of data, computational resources, and meticulous tuning for successful training.
We propose methods to leverage pre-trained unconditional diffusion models with additional guidance for the purpose of conditional image generative.
arXiv Detail & Related papers (2023-12-20T18:27:53Z) - Adaptive Training Meets Progressive Scaling: Elevating Efficiency in Diffusion Models [52.1809084559048]
We propose a novel two-stage divide-and-conquer training strategy termed TDC Training.
It groups timesteps based on task similarity and difficulty, assigning highly customized denoising models to each group, thereby enhancing the performance of diffusion models.
While two-stage training avoids the need to train each model separately, the total training cost is even lower than training a single unified denoising model.
arXiv Detail & Related papers (2023-12-20T03:32:58Z) - LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image
Diffusion Models with Large Language Models [62.75006608940132]
This work proposes to enhance prompt understanding capabilities in text-to-image diffusion models.
Our method leverages a pretrained large language model for grounded generation in a novel two-stage process.
Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images.
arXiv Detail & Related papers (2023-05-23T03:59:06Z) - On Distillation of Guided Diffusion Models [94.95228078141626]
We propose an approach to distilling classifier-free guided diffusion models into models that are fast to sample from.
For standard diffusion models trained on the pixelspace, our approach is able to generate images visually comparable to that of the original model.
For diffusion models trained on the latent-space (e.g., Stable Diffusion), our approach is able to generate high-fidelity images using as few as 1 to 4 denoising steps.
arXiv Detail & Related papers (2022-10-06T18:03:56Z) - KNN-Diffusion: Image Generation via Large-Scale Retrieval [40.6656651653888]
Learning to adapt enables several new capabilities.
Fine-tuning trained models to new samples can be achieved by simply adding them to the table.
Our diffusion-based model trains on images only, by leveraging a joint Text-Image multi-modal metric.
arXiv Detail & Related papers (2022-04-06T14:13:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.