Fast Training of Diffusion Models with Masked Transformers
- URL: http://arxiv.org/abs/2306.09305v2
- Date: Tue, 5 Mar 2024 01:10:18 GMT
- Title: Fast Training of Diffusion Models with Masked Transformers
- Authors: Hongkai Zheng, Weili Nie, Arash Vahdat, Anima Anandkumar
- Abstract summary: We propose an efficient approach to train large diffusion models with masked transformers.
Specifically, we randomly mask out a high proportion of patches in diffused input images during training.
Experiments on ImageNet-256x256 and ImageNet-512x512 show that our approach achieves competitive and even better generative performance than the state-of-the-art Diffusion Transformer (DiT) model.
- Score: 107.77340216247516
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We propose an efficient approach to train large diffusion models with masked
transformers. While masked transformers have been extensively explored for
representation learning, their application to generative learning is less
explored in the vision domain. Our work is the first to exploit masked training
to reduce the training cost of diffusion models significantly. Specifically, we
randomly mask out a high proportion (e.g., 50%) of patches in diffused input
images during training. For masked training, we introduce an asymmetric
encoder-decoder architecture consisting of a transformer encoder that operates
only on unmasked patches and a lightweight transformer decoder on full patches.
To promote a long-range understanding of full patches, we add an auxiliary task
of reconstructing masked patches to the denoising score matching objective that
learns the score of unmasked patches. Experiments on ImageNet-256x256 and
ImageNet-512x512 show that our approach achieves competitive and even better
generative performance than the state-of-the-art Diffusion Transformer (DiT)
model, using only around 30% of its original training time. Thus, our method
shows a promising way of efficiently training large transformer-based diffusion
models without sacrificing the generative performance.
Related papers
- Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget [53.311109531586844]
We demonstrate very low-cost training of large-scale T2I diffusion transformer models.
We train a 1.16 billion parameter sparse transformer with only $1,890 economical cost and achieve a 12.7 FID in zero-shot generation.
We aim to release our end-to-end training pipeline to further democratize the training of large-scale diffusion models on micro-budgets.
arXiv Detail & Related papers (2024-07-22T17:23:28Z) - Unified Auto-Encoding with Masked Diffusion [15.264296748357157]
We propose a unified self-supervised objective, dubbed Unified Masked Diffusion (UMD)
UMD combines patch-based and noise-based corruption techniques within a single auto-encoding framework.
It achieves strong performance in downstream generative and representation learning tasks.
arXiv Detail & Related papers (2024-06-25T16:24:34Z) - Patch Diffusion: Faster and More Data-Efficient Training of Diffusion
Models [166.64847903649598]
We propose Patch Diffusion, a generic patch-wise training framework.
Patch Diffusion significantly reduces the training time costs while improving data efficiency.
We achieve outstanding FID scores in line with state-of-the-art benchmarks.
arXiv Detail & Related papers (2023-04-25T02:35:54Z) - MP-Former: Mask-Piloted Transformer for Image Segmentation [16.620469868310288]
Mask2Former suffers from inconsistent mask predictions between decoder layers.
We propose a mask-piloted training approach, which feeds noised ground-truth masks in masked-attention and trains the model to reconstruct the original ones.
arXiv Detail & Related papers (2023-03-13T17:57:59Z) - Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data.
We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process.
In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z) - A Unified View of Masked Image Modeling [117.79456335844439]
Masked image modeling has demonstrated great potential to eliminate the label-hungry problem of training large-scale vision Transformers.
We introduce a simple yet effective method, termed as MaskDistill, which reconstructs normalized semantic features from teacher models at the masked positions.
Experimental results on image classification and semantic segmentation show that MaskDistill achieves comparable or superior performance than state-of-the-art methods.
arXiv Detail & Related papers (2022-10-19T14:59:18Z) - MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of
Hierarchical Vision Transformers [35.26148770111607]
Mixed and Masked AutoEncoder (MixMAE) is a simple but efficient pretraining method that is applicable to various hierarchical Vision Transformers.
This paper explores using Swin Transformer with a large window size and scales up to huge model size (to reach 600M parameters). Notably, MixMAE with Swin-B/W14 achieves 85.1% top-1 accuracy on ImageNet-1K by pretraining for 600 epochs.
arXiv Detail & Related papers (2022-05-26T04:00:42Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.