Masked Autoencoders as Image Processors
- URL: http://arxiv.org/abs/2303.17316v1
- Date: Thu, 30 Mar 2023 12:09:35 GMT
- Title: Masked Autoencoders as Image Processors
- Authors: Huiyu Duan, Wei Shen, Xiongkuo Min, Danyang Tu, Long Teng, Jia Wang,
Guangtao Zhai
- Abstract summary: Masked autoencoders (MAE) for feature pre-training have unleashed the potential of Transformers.
In this paper, we show that masked autoencoders are also scalable self-supervised learners for image processing tasks.
- Score: 35.531254533198165
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers have shown significant effectiveness for various vision tasks
including both high-level vision and low-level vision. Recently, masked
autoencoders (MAE) for feature pre-training have further unleashed the
potential of Transformers, leading to state-of-the-art performances on various
high-level vision tasks. However, the significance of MAE pre-training on
low-level vision tasks has not been sufficiently explored. In this paper, we
show that masked autoencoders are also scalable self-supervised learners for
image processing tasks. We first present an efficient Transformer model
considering both channel attention and shifted-window-based self-attention
termed CSformer. Then we develop an effective MAE architecture for image
processing (MAEIP) tasks. Extensive experimental results show that with the
help of MAEIP pre-training, our proposed CSformer achieves state-of-the-art
performance on various image processing tasks, including Gaussian denoising,
real image denoising, single-image motion deblurring, defocus deblurring, and
image deraining.
Related papers
- Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [83.85856356798531]
VistaLLM is a visual system that addresses coarse- and fine-grained vision-language tasks.
It employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences.
We also introduce a novel task, AttCoSeg, which boosts the model's reasoning and grounding capability over multiple input images.
arXiv Detail & Related papers (2023-12-19T18:53:01Z) - Feature Guided Masked Autoencoder for Self-supervised Learning in Remote
Sensing [16.683132793313693]
Masked AutoEncoder (MAE) has attracted wide attention for pretraining vision transformers in remote sensing.
We propose Feature Guided Masked Autoencoder (FG-MAE): reconstructing a combination of Histograms of Oriented Graidents (HOG) and Normalized Difference Indices (NDI) for multispectral images, and reconstructing HOG for SAR images.
arXiv Detail & Related papers (2023-10-28T09:43:13Z) - Skip-Attention: Improving Vision Transformers by Paying Less Attention [55.47058516775423]
Vision computation transformers (ViTs) use expensive self-attention operations in every layer.
We propose SkipAt, a method to reuse self-attention from preceding layers to approximate attention at one or more subsequent layers.
We show the effectiveness of our method in image classification and self-supervised learning on ImageNet-1K, semantic segmentation on ADE20K, image denoising on SIDD, and video denoising on DAVIS.
arXiv Detail & Related papers (2023-01-05T18:59:52Z) - Masked Image Modeling with Denoising Contrast [30.31920660487222]
Masked image modeling dominates this line of research with state-of-the-art performance on vision Transformers.
We introduce a new pre-training method, ConMIM, to produce simple intra-image inter-patch contrastive constraints.
ConMIM-pretrained vision Transformers with various scales achieve promising results on downstream image classification, semantic segmentation, object detection, and instance segmentation tasks.
arXiv Detail & Related papers (2022-05-19T15:22:29Z) - The Devil is in the Frequency: Geminated Gestalt Autoencoder for
Self-Supervised Visual Pre-Training [13.087987450384036]
We present a new Masked Image Modeling (MIM), termed Geminated Autoencoder (Ge$2$-AE) for visual pre-training.
Specifically, we equip our model with geminated decoders in charge of reconstructing image contents from both pixel and frequency space.
arXiv Detail & Related papers (2022-04-18T09:22:55Z) - On Efficient Transformer and Image Pre-training for Low-level Vision [74.22436001426517]
Pre-training has marked numerous state of the arts in high-level computer vision.
We present an in-depth study of image pre-training.
We find pre-training plays strikingly different roles in low-level tasks.
arXiv Detail & Related papers (2021-12-19T15:50:48Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - Pre-Trained Image Processing Transformer [95.93031793337613]
We develop a new pre-trained model, namely, image processing transformer (IPT)
We present to utilize the well-known ImageNet benchmark for generating a large amount of corrupted image pairs.
IPT model is trained on these images with multi-heads and multi-tails.
arXiv Detail & Related papers (2020-12-01T09:42:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.