Related papers: ConvMAE: Masked Convolution Meets Masked Autoencoders

ConvMAE: Masked Convolution Meets Masked Autoencoders

URL: http://arxiv.org/abs/2205.03892v1
Date: Sun, 8 May 2022 15:12:19 GMT
Title: ConvMAE: Masked Convolution Meets Masked Autoencoders
Authors: Peng Gao, Teli Ma, Hongsheng Li, Jifeng Dai, Yu Qiao
Abstract summary: Masked auto-encoding for feature pretraining and multi-scale hybrid convolution-transformer architectures can further unleash the potentials of ViT. Our ConvMAE framework demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme. Based on our pretrained ConvMAE models, ConvMAE-Base improves ImageNet-1K finetuning accuracy by 1.4% compared with MAE-Base.
Score: 65.15953258300958
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision Transformers (ViT) become widely-adopted architectures for various vision tasks. Masked auto-encoding for feature pretraining and multi-scale hybrid convolution-transformer architectures can further unleash the potentials of ViT, leading to state-of-the-art performances on image classification, detection and semantic segmentation. In this paper, our ConvMAE framework demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme. However, directly using the original masking strategy leads to the heavy computational cost and pretraining-finetuning discrepancy. To tackle the issue, we adopt the masked convolution to prevent information leakage in the convolution blocks. A simple block-wise masking strategy is proposed to ensure computational efficiency. We also propose to more directly supervise the multi-scale features of the encoder to boost multi-scale features. Based on our pretrained ConvMAE models, ConvMAE-Base improves ImageNet-1K finetuning accuracy by 1.4% compared with MAE-Base. On object detection, ConvMAE-Base finetuned for only 25 epochs surpasses MAE-Base fined-tuned for 100 epochs by 2.9% box AP and 2.2% mask AP respectively. Code and pretrained models are available at https://github.com/Alpha-VL/ConvMAE.

Related papers

High-Frequency Prior-Driven Adaptive Masking for Accelerating Image Super-Resolution [87.56382172827526]
High-frequency regions are most critical for reconstruction.<n>We propose a training-free adaptive masking module for acceleration.<n>Our method reduces FLOPs by 24--43% for state-of-the-art models.
arXiv Detail & Related papers (2025-05-11T13:18:03Z)
High-Quality Mask Tuning Matters for Open-Vocabulary Segmentation [109.19165503929992]
We present MaskCLIP++, which uses ground-truth masks instead of generated masks to enhance the mask classification capability of CLIP. After low-cost fine-tuning, MaskCLIP++ significantly improves the mask classification performance on multi-domain datasets. We achieve performance improvements of +1.7, +2.3, +2.1, +3.1, and +0.3 mIoU on the A-847, PC-459, A-150, PC-59, and PAS-20 datasets.
arXiv Detail & Related papers (2024-12-16T05:44:45Z)
Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning [116.75939193785143]
Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones. In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
arXiv Detail & Related papers (2024-07-08T12:28:56Z)
Adapting LLaMA Decoder to Vision Transformer [65.47663195233802]
This work examines whether decoder-only Transformers such as LLaMA can be adapted to the computer vision field. We first "LLaMAfy" a standard ViT step-by-step to align with LLaMA's architecture, and find that directly applying a causal mask to the self-attention brings an attention collapse issue. We develop a soft mask strategy that gradually introduces a causal mask to the self-attention at the onset of training to facilitate the optimization behavior.
arXiv Detail & Related papers (2024-04-10T06:30:08Z)
VideoMAC: Video Masked Autoencoders Meet ConvNets [26.723998063596635]
VideoMAC employs symmetric masking on randomly sampled pairs of video frames. We present a simple yet effective masked video modeling (MVM) approach, a dual encoder architecture. VideoMAC, empowering classical (ResNet) / modern (ConvNeXt) convolutional encoders, outperforms ViT-based approaches on downstream tasks.
arXiv Detail & Related papers (2024-02-29T12:09:25Z)
Mask Propagation for Efficient Video Semantic Segmentation [63.09523058489429]
Video Semantic baseline degradation (VSS) involves assigning a semantic label to each pixel in a video sequence. We propose an efficient mask propagation framework for VSS, called SSSS. Our framework reduces up to 4x FLOPs compared to the per-frame Mask2Former with only up to 2% mIoU on the Cityscapes validation set.
arXiv Detail & Related papers (2023-10-29T09:55:28Z)
Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking [35.11620617064127]
Masked Autoencoders (MAE) have been popular paradigms for large-scale vision representation pre-training. We propose MR-MAE, which jointly learns high-level and low-level representations without interference during pre-training. On ImageNet-1K, the MR-MAE base pre-trained for only 400 epochs achieves 85.8% top-1 accuracy after fine-tuning.
arXiv Detail & Related papers (2023-03-09T18:28:18Z)
Semi-MAE: Masked Autoencoders for Semi-supervised Vision Transformers [5.29690621203603]
Semi-MAE is a pure ViT-based SSL framework consisting of a parallel MAE branch to assist the visual representation learning. Semi-MAE achieves 75.9% top-1 accuracy on ImageNet with 10% labels, surpassing prior state-of-the-art in semi-supervised image classification.
arXiv Detail & Related papers (2023-01-04T03:59:17Z)
Pruning Self-attentions into Convolutional Layers in Single Path [89.55361659622305]
Vision Transformers (ViTs) have achieved impressive performance over various computer vision tasks. We propose Single-Path Vision Transformer pruning (SPViT) to efficiently and automatically compress the pre-trained ViTs. Our SPViT can trim 52.0% FLOPs for DeiT-B and get an impressive 0.6% top-1 accuracy gain simultaneously.
arXiv Detail & Related papers (2021-11-23T11:35:54Z)
Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.