Unified Auto-Encoding with Masked Diffusion
- URL: http://arxiv.org/abs/2406.17688v1
- Date: Tue, 25 Jun 2024 16:24:34 GMT
- Title: Unified Auto-Encoding with Masked Diffusion
- Authors: Philippe Hansen-Estruch, Sriram Vishwanath, Amy Zhang, Manan Tomar,
- Abstract summary: We propose a unified self-supervised objective, dubbed Unified Masked Diffusion (UMD)
UMD combines patch-based and noise-based corruption techniques within a single auto-encoding framework.
It achieves strong performance in downstream generative and representation learning tasks.
- Score: 15.264296748357157
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: At the core of both successful generative and self-supervised representation learning models there is a reconstruction objective that incorporates some form of image corruption. Diffusion models implement this approach through a scheduled Gaussian corruption process, while masked auto-encoder models do so by masking patches of the image. Despite their different approaches, the underlying similarity in their methodologies suggests a promising avenue for an auto-encoder capable of both de-noising tasks. We propose a unified self-supervised objective, dubbed Unified Masked Diffusion (UMD), that combines patch-based and noise-based corruption techniques within a single auto-encoding framework. Specifically, UMD modifies the diffusion transformer (DiT) training process by introducing an additional noise-free, high masking representation step in the diffusion noising schedule, and utilizes a mixed masked and noised image for subsequent timesteps. By integrating features useful for diffusion modeling and for predicting masked patch tokens, UMD achieves strong performance in downstream generative and representation learning tasks, including linear probing and class-conditional generation. This is achieved without the need for heavy data augmentations, multiple views, or additional encoders. Furthermore, UMD improves over the computational efficiency of prior diffusion based methods in total training time. We release our code at https://github.com/philippe-eecs/small-vision.
Related papers
- Denoising Autoregressive Representation Learning [13.185567468951628]
Our method, DARL, employs a decoder-only Transformer to predict image patches autoregressively.
We show that the learned representation can be improved by using tailored noise schedules and longer training in larger models.
arXiv Detail & Related papers (2024-03-08T10:19:00Z) - Fast Training of Diffusion Models with Masked Transformers [107.77340216247516]
We propose an efficient approach to train large diffusion models with masked transformers.
Specifically, we randomly mask out a high proportion of patches in diffused input images during training.
Experiments on ImageNet-256x256 and ImageNet-512x512 show that our approach achieves competitive and even better generative performance than the state-of-the-art Diffusion Transformer (DiT) model.
arXiv Detail & Related papers (2023-06-15T17:38:48Z) - Learning to Mask and Permute Visual Tokens for Vision Transformer
Pre-Training [59.923672191632065]
We propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT)
MaPeT employs autoregressive and permuted predictions to capture intra-patch dependencies.
Our results demonstrate that MaPeT achieves competitive performance on ImageNet.
arXiv Detail & Related papers (2023-06-12T18:12:19Z) - Diffusion Models as Masked Autoencoders [52.442717717898056]
We revisit generatively pre-training visual representations in light of recent interest in denoising diffusion models.
While directly pre-training with diffusion models does not produce strong representations, we condition diffusion models on masked input and formulate diffusion models as masked autoencoders (DiffMAE)
We perform a comprehensive study on the pros and cons of design choices and build connections between diffusion models and masked autoencoders.
arXiv Detail & Related papers (2023-04-06T17:59:56Z) - Denoising Diffusion Autoencoders are Unified Self-supervised Learners [58.194184241363175]
This paper shows that the networks in diffusion models, namely denoising diffusion autoencoders (DDAE), are unified self-supervised learners.
DDAE has already learned strongly linear-separable representations within its intermediate layers without auxiliary encoders.
Our diffusion-based approach achieves 95.9% and 50.0% linear evaluation accuracies on CIFAR-10 and Tiny-ImageNet.
arXiv Detail & Related papers (2023-03-17T04:20:47Z) - Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data.
We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process.
In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z) - Diffusion-Based Representation Learning [65.55681678004038]
We augment the denoising score matching framework to enable representation learning without any supervised signal.
In contrast, the introduced diffusion-based representation learning relies on a new formulation of the denoising score matching objective.
Using the same approach, we propose to learn an infinite-dimensional latent code that achieves improvements of state-of-the-art models on semi-supervised image classification.
arXiv Detail & Related papers (2021-05-29T09:26:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.