Diffusion Models as Masked Autoencoders
- URL: http://arxiv.org/abs/2304.03283v1
- Date: Thu, 6 Apr 2023 17:59:56 GMT
- Title: Diffusion Models as Masked Autoencoders
- Authors: Chen Wei, Karttikeya Mangalam, Po-Yao Huang, Yanghao Li, Haoqi Fan, Hu
Xu, Huiyu Wang, Cihang Xie, Alan Yuille, Christoph Feichtenhofer
- Abstract summary: We revisit generatively pre-training visual representations in light of recent interest in denoising diffusion models.
While directly pre-training with diffusion models does not produce strong representations, we condition diffusion models on masked input and formulate diffusion models as masked autoencoders (DiffMAE)
We perform a comprehensive study on the pros and cons of design choices and build connections between diffusion models and masked autoencoders.
- Score: 52.442717717898056
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There has been a longstanding belief that generation can facilitate a true
understanding of visual data. In line with this, we revisit generatively
pre-training visual representations in light of recent interest in denoising
diffusion models. While directly pre-training with diffusion models does not
produce strong representations, we condition diffusion models on masked input
and formulate diffusion models as masked autoencoders (DiffMAE). Our approach
is capable of (i) serving as a strong initialization for downstream recognition
tasks, (ii) conducting high-quality image inpainting, and (iii) being
effortlessly extended to video where it produces state-of-the-art
classification accuracy. We further perform a comprehensive study on the pros
and cons of design choices and build connections between diffusion models and
masked autoencoders.
Related papers
- DiffuEraser: A Diffusion Model for Video Inpainting [13.292164408616257]
We introduce DiffuEraser, a video inpainting model based on stable diffusion, to fill masked regions with greater details and more coherent structures.
We also expand the temporal receptive fields of both the prior model and DiffuEraser, and further enhance consistency by leveraging the temporal smoothing property of Video Diffusion Models.
arXiv Detail & Related papers (2025-01-17T08:03:02Z) - ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer [95.80384464922147]
Continuous visual generation requires the full-sequence diffusion-based approach.
We present ACDiT, an Autoregressive blockwise Conditional Diffusion Transformer.
We demonstrate that ACDiT can be seamlessly used in visual understanding tasks despite being trained on the diffusion objective.
arXiv Detail & Related papers (2024-12-10T18:13:20Z) - Denoising Autoregressive Representation Learning [13.185567468951628]
Our method, DARL, employs a decoder-only Transformer to predict image patches autoregressively.
We show that the learned representation can be improved by using tailored noise schedules and longer training in larger models.
arXiv Detail & Related papers (2024-03-08T10:19:00Z) - Guided Diffusion from Self-Supervised Diffusion Features [49.78673164423208]
Guidance serves as a key concept in diffusion models, yet its effectiveness is often limited by the need for extra data annotation or pretraining.
We propose a framework to extract guidance from, and specifically for, diffusion models.
arXiv Detail & Related papers (2023-12-14T11:19:11Z) - SODA: Bottleneck Diffusion Models for Representation Learning [75.7331354734152]
We introduce SODA, a self-supervised diffusion model, designed for representation learning.
The model incorporates an image encoder, which distills a source view into a compact representation, that guides the generation of related novel views.
We show that by imposing a tight bottleneck between the encoder and a denoising decoder, we can turn diffusion models into strong representation learners.
arXiv Detail & Related papers (2023-11-29T18:53:34Z) - InfoDiffusion: Representation Learning Using Information Maximizing
Diffusion Models [35.566528358691336]
InfoDiffusion is an algorithm that augments diffusion models with low-dimensional latent variables.
InfoDiffusion relies on a learning objective regularized with the mutual information between observed and hidden variables.
We find that InfoDiffusion learns disentangled and human-interpretable latent representations that are competitive with state-of-the-art generative and contrastive methods.
arXiv Detail & Related papers (2023-06-14T21:48:38Z) - Denoising Diffusion Autoencoders are Unified Self-supervised Learners [58.194184241363175]
This paper shows that the networks in diffusion models, namely denoising diffusion autoencoders (DDAE), are unified self-supervised learners.
DDAE has already learned strongly linear-separable representations within its intermediate layers without auxiliary encoders.
Our diffusion-based approach achieves 95.9% and 50.0% linear evaluation accuracies on CIFAR-10 and Tiny-ImageNet.
arXiv Detail & Related papers (2023-03-17T04:20:47Z) - Diffusion Models in Vision: A Survey [73.10116197883303]
A diffusion model is a deep generative model that is based on two stages, a forward diffusion stage and a reverse diffusion stage.
Diffusion models are widely appreciated for the quality and diversity of the generated samples, despite their known computational burdens.
arXiv Detail & Related papers (2022-09-10T22:00:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.