Denoising Masked AutoEncoders are Certifiable Robust Vision Learners
- URL: http://arxiv.org/abs/2210.06983v1
- Date: Mon, 10 Oct 2022 12:37:59 GMT
- Title: Denoising Masked AutoEncoders are Certifiable Robust Vision Learners
- Authors: Quanlin Wu, Hang Ye, Yuntian Gu, Huishuai Zhang, Liwei Wang, Di He
- Abstract summary: We propose a new self-supervised method, which is called Denoising Masked AutoEncoders (DMAE)
DMAE corrupts each image by adding Gaussian noises to each pixel value and randomly masking several patches.
A Transformer-based encoder-decoder model is then trained to reconstruct the original image from the corrupted one.
- Score: 37.04863068273281
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a new self-supervised method, which is called
Denoising Masked AutoEncoders (DMAE), for learning certified robust classifiers
of images. In DMAE, we corrupt each image by adding Gaussian noises to each
pixel value and randomly masking several patches. A Transformer-based
encoder-decoder model is then trained to reconstruct the original image from
the corrupted one. In this learning paradigm, the encoder will learn to capture
relevant semantics for the downstream tasks, which is also robust to Gaussian
additive noises. We show that the pre-trained encoder can naturally be used as
the base classifier in Gaussian smoothed models, where we can analytically
compute the certified radius for any data point. Although the proposed method
is simple, it yields significant performance improvement in downstream
classification tasks. We show that the DMAE ViT-Base model, which just uses
1/10 parameters of the model developed in recent work arXiv:2206.10550,
achieves competitive or better certified accuracy in various settings. The DMAE
ViT-Large model significantly surpasses all previous results, establishing a
new state-of-the-art on ImageNet dataset. We further demonstrate that the
pre-trained model has good transferability to the CIFAR-10 dataset, suggesting
its wide adaptability. Models and code are available at
https://github.com/quanlin-wu/dmae.
Related papers
- Unified Auto-Encoding with Masked Diffusion [15.264296748357157]
We propose a unified self-supervised objective, dubbed Unified Masked Diffusion (UMD)
UMD combines patch-based and noise-based corruption techniques within a single auto-encoding framework.
It achieves strong performance in downstream generative and representation learning tasks.
arXiv Detail & Related papers (2024-06-25T16:24:34Z) - Denoising Autoregressive Representation Learning [13.185567468951628]
Our method, DARL, employs a decoder-only Transformer to predict image patches autoregressively.
We show that the learned representation can be improved by using tailored noise schedules and longer training in larger models.
arXiv Detail & Related papers (2024-03-08T10:19:00Z) - Improve Supervised Representation Learning with Masked Image Modeling [30.30649867772395]
We propose a simple yet effective setup that can easily integrate masked image modeling into existing supervised training paradigms.
We show with minimal change in architecture and no overhead in inference that this setup is able to improve the quality of the learned representations.
arXiv Detail & Related papers (2023-12-01T22:03:25Z) - Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data.
We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process.
In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z) - Masked Images Are Counterfactual Samples for Robust Fine-tuning [77.82348472169335]
Fine-tuning deep learning models can lead to a trade-off between in-distribution (ID) performance and out-of-distribution (OOD) robustness.
We propose a novel fine-tuning method, which uses masked images as counterfactual samples that help improve the robustness of the fine-tuning model.
arXiv Detail & Related papers (2023-03-06T11:51:28Z) - MAGE: MAsked Generative Encoder to Unify Representation Learning and
Image Synthesis [33.46831766206675]
MAsked Generative (MAGE) is first framework to unify SOTA image generation and self-supervised representation learning.
Inspired by previous generative models, MAGE uses semantic tokens learned by a vector-quantized GAN at inputs and outputs.
On ImageNet-1K, a single MAGE ViT-L model obtains 9.10 FID in the task of class-unconditional image generation.
arXiv Detail & Related papers (2022-11-16T18:59:02Z) - SdAE: Self-distillated Masked Autoencoder [95.3684955370897]
Self-distillated masked AutoEncoder network SdAE is proposed in this paper.
With only 300 epochs pre-training, a vanilla ViT-Base model achieves an 84.1% fine-tuning accuracy on ImageNet-1k classification.
arXiv Detail & Related papers (2022-07-31T15:07:25Z) - Corrupted Image Modeling for Self-Supervised Visual Pre-Training [103.99311611776697]
We introduce Corrupted Image Modeling (CIM) for self-supervised visual pre-training.
CIM uses an auxiliary generator with a small trainable BEiT to corrupt the input image instead of using artificial mask tokens.
After pre-training, the enhancer can be used as a high-capacity visual encoder for downstream tasks.
arXiv Detail & Related papers (2022-02-07T17:59:04Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z) - Locally Masked Convolution for Autoregressive Models [107.4635841204146]
LMConv is a simple modification to the standard 2D convolution that allows arbitrary masks to be applied to the weights at each location in the image.
We learn an ensemble of distribution estimators that share parameters but differ in generation order, achieving improved performance on whole-image density estimation.
arXiv Detail & Related papers (2020-06-22T17:59:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.