Masked Diffusion Models Are Fast Distribution Learners
- URL: http://arxiv.org/abs/2306.11363v4
- Date: Mon, 27 Nov 2023 11:34:52 GMT
- Title: Masked Diffusion Models Are Fast Distribution Learners
- Authors: Jiachen Lei, Qinglong Wang, Peng Cheng, Zhongjie Ba, Zhan Qin, Zhibo
Wang, Zhenguang Liu, Kui Ren
- Abstract summary: Diffusion models are commonly trained to learn all fine-grained visual information from scratch.
We show that it suffices to train a strong diffusion model by first pre-training the model to learn some primer distribution.
Then the pre-trained model can be fine-tuned for various generation tasks efficiently.
- Score: 32.485235866596064
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Diffusion model has emerged as the \emph{de-facto} model for image
generation, yet the heavy training overhead hinders its broader adoption in the
research community. We observe that diffusion models are commonly trained to
learn all fine-grained visual information from scratch. This paradigm may cause
unnecessary training costs hence requiring in-depth investigation. In this
work, we show that it suffices to train a strong diffusion model by first
pre-training the model to learn some primer distribution that loosely
characterizes the unknown real image distribution. Then the pre-trained model
can be fine-tuned for various generation tasks efficiently. In the pre-training
stage, we propose to mask a high proportion (e.g., up to 90\%) of input images
to approximately represent the primer distribution and introduce a masked
denoising score matching objective to train a model to denoise visible areas.
In subsequent fine-tuning stage, we efficiently train diffusion model without
masking. Utilizing the two-stage training framework, we achieves significant
training acceleration and a new FID score record of 6.27 on CelebA-HQ $256
\times 256$ for ViT-based diffusion models. The generalizability of a
pre-trained model further helps building models that perform better than ones
trained from scratch on different downstream datasets. For instance, a
diffusion model pre-trained on VGGFace2 attains a 46\% quality improvement when
fine-tuned on a different dataset that contains only 3000 images. Our code is
available at \url{https://github.com/jiachenlei/maskdm}.
Related papers
- Diffusion Models Need Visual Priors for Image Generation [86.92260591389818]
Diffusion on Diffusion (DoD) is an innovative multi-stage generation framework that first extracts visual priors from previously generated samples, then provides rich guidance for the diffusion model.
We evaluate DoD on the popular ImageNet-$256 times 256$ dataset, reducing 7$times$ training cost compared to SiT and DiT.
Our largest model DoD-XL achieves an FID-50K score of 1.83 with only 1 million training steps, which surpasses other state-of-the-art methods without bells and whistles during inference.
arXiv Detail & Related papers (2024-10-11T05:03:56Z) - FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models [56.71672127740099]
We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets.
We leverage different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation.
Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets.
arXiv Detail & Related papers (2024-03-29T10:38:25Z) - Training Class-Imbalanced Diffusion Model Via Overlap Optimization [55.96820607533968]
Diffusion models trained on real-world datasets often yield inferior fidelity for tail classes.
Deep generative models, including diffusion models, are biased towards classes with abundant training images.
We propose a method based on contrastive learning to minimize the overlap between distributions of synthetic images for different classes.
arXiv Detail & Related papers (2024-02-16T16:47:21Z) - Large-scale Reinforcement Learning for Diffusion Models [30.164571425479824]
Text-to-image diffusion models are susceptible to implicit biases that arise from web-scale text-image training pairs.
We present an effective scalable algorithm to improve diffusion models using Reinforcement Learning (RL)
We show how our approach substantially outperforms existing methods for aligning diffusion models with human preferences.
arXiv Detail & Related papers (2024-01-20T08:10:43Z) - Conditional Image Generation with Pretrained Generative Model [1.4685355149711303]
diffusion models have gained popularity for their ability to generate higher-quality images in comparison to GAN models.
These models require a huge amount of data, computational resources, and meticulous tuning for successful training.
We propose methods to leverage pre-trained unconditional diffusion models with additional guidance for the purpose of conditional image generative.
arXiv Detail & Related papers (2023-12-20T18:27:53Z) - LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image
Diffusion Models with Large Language Models [62.75006608940132]
This work proposes to enhance prompt understanding capabilities in text-to-image diffusion models.
Our method leverages a pretrained large language model for grounded generation in a novel two-stage process.
Our method significantly outperforms the base diffusion model and several strong baselines in accurately generating images.
arXiv Detail & Related papers (2023-05-23T03:59:06Z) - DINOv2: Learning Robust Visual Features without Supervision [75.42921276202522]
This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources.
Most of the technical contributions aim at accelerating and stabilizing the training at scale.
In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature.
arXiv Detail & Related papers (2023-04-14T15:12:19Z) - On Distillation of Guided Diffusion Models [94.95228078141626]
We propose an approach to distilling classifier-free guided diffusion models into models that are fast to sample from.
For standard diffusion models trained on the pixelspace, our approach is able to generate images visually comparable to that of the original model.
For diffusion models trained on the latent-space (e.g., Stable Diffusion), our approach is able to generate high-fidelity images using as few as 1 to 4 denoising steps.
arXiv Detail & Related papers (2022-10-06T18:03:56Z) - On Data Scaling in Masked Image Modeling [36.00347416479826]
Masked image modeling (MIM) is suspected to be unable to benefit from larger data.
Data scales ranging from 10% of ImageNet-1K to full ImageNet-22K, model sizes ranging from 49 million to 1 billion, and training lengths ranging from 125K iterations to 500K iterations.
validation loss in pre-training is a good indicator to measure how well the model performs for fine-tuning on multiple tasks.
arXiv Detail & Related papers (2022-06-09T17:58:24Z) - KNN-Diffusion: Image Generation via Large-Scale Retrieval [40.6656651653888]
Learning to adapt enables several new capabilities.
Fine-tuning trained models to new samples can be achieved by simply adding them to the table.
Our diffusion-based model trains on images only, by leveraging a joint Text-Image multi-modal metric.
arXiv Detail & Related papers (2022-04-06T14:13:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.