ConvMAE: Masked Convolution Meets Masked Autoencoders
- URL: http://arxiv.org/abs/2205.03892v1
- Date: Sun, 8 May 2022 15:12:19 GMT
- Title: ConvMAE: Masked Convolution Meets Masked Autoencoders
- Authors: Peng Gao, Teli Ma, Hongsheng Li, Jifeng Dai, Yu Qiao
- Abstract summary: Masked auto-encoding for feature pretraining and multi-scale hybrid convolution-transformer architectures can further unleash the potentials of ViT.
Our ConvMAE framework demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme.
Based on our pretrained ConvMAE models, ConvMAE-Base improves ImageNet-1K finetuning accuracy by 1.4% compared with MAE-Base.
- Score: 65.15953258300958
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Transformers (ViT) become widely-adopted architectures for various
vision tasks. Masked auto-encoding for feature pretraining and multi-scale
hybrid convolution-transformer architectures can further unleash the potentials
of ViT, leading to state-of-the-art performances on image classification,
detection and semantic segmentation. In this paper, our ConvMAE framework
demonstrates that multi-scale hybrid convolution-transformer can learn more
discriminative representations via the mask auto-encoding scheme. However,
directly using the original masking strategy leads to the heavy computational
cost and pretraining-finetuning discrepancy. To tackle the issue, we adopt the
masked convolution to prevent information leakage in the convolution blocks. A
simple block-wise masking strategy is proposed to ensure computational
efficiency. We also propose to more directly supervise the multi-scale features
of the encoder to boost multi-scale features. Based on our pretrained ConvMAE
models, ConvMAE-Base improves ImageNet-1K finetuning accuracy by 1.4% compared
with MAE-Base. On object detection, ConvMAE-Base finetuned for only 25 epochs
surpasses MAE-Base fined-tuned for 100 epochs by 2.9% box AP and 2.2% mask AP
respectively. Code and pretrained models are available at
https://github.com/Alpha-VL/ConvMAE.
Related papers
- Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning [116.75939193785143]
Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones.
In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
arXiv Detail & Related papers (2024-07-08T12:28:56Z) - Adapting LLaMA Decoder to Vision Transformer [65.47663195233802]
This work examines whether decoder-only Transformers such as LLaMA can be adapted to the computer vision field.
We first "LLaMAfy" a standard ViT step-by-step to align with LLaMA's architecture, and find that directly applying a causal mask to the self-attention brings an attention collapse issue.
We develop a soft mask strategy that gradually introduces a causal mask to the self-attention at the onset of training to facilitate the optimization behavior.
arXiv Detail & Related papers (2024-04-10T06:30:08Z) - Mask Propagation for Efficient Video Semantic Segmentation [63.09523058489429]
Video Semantic baseline degradation (VSS) involves assigning a semantic label to each pixel in a video sequence.
We propose an efficient mask propagation framework for VSS, called SSSS.
Our framework reduces up to 4x FLOPs compared to the per-frame Mask2Former with only up to 2% mIoU on the Cityscapes validation set.
arXiv Detail & Related papers (2023-10-29T09:55:28Z) - Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature
Mimicking [35.11620617064127]
Masked Autoencoders (MAE) have been popular paradigms for large-scale vision representation pre-training.
We propose MR-MAE, which jointly learns high-level and low-level representations without interference during pre-training.
On ImageNet-1K, the MR-MAE base pre-trained for only 400 epochs achieves 85.8% top-1 accuracy after fine-tuning.
arXiv Detail & Related papers (2023-03-09T18:28:18Z) - Semi-MAE: Masked Autoencoders for Semi-supervised Vision Transformers [5.29690621203603]
Semi-MAE is a pure ViT-based SSL framework consisting of a parallel MAE branch to assist the visual representation learning.
Semi-MAE achieves 75.9% top-1 accuracy on ImageNet with 10% labels, surpassing prior state-of-the-art in semi-supervised image classification.
arXiv Detail & Related papers (2023-01-04T03:59:17Z) - Pruning Self-attentions into Convolutional Layers in Single Path [89.55361659622305]
Vision Transformers (ViTs) have achieved impressive performance over various computer vision tasks.
We propose Single-Path Vision Transformer pruning (SPViT) to efficiently and automatically compress the pre-trained ViTs.
Our SPViT can trim 52.0% FLOPs for DeiT-B and get an impressive 0.6% top-1 accuracy gain simultaneously.
arXiv Detail & Related papers (2021-11-23T11:35:54Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.