MST: Masked Self-Supervised Transformer for Visual Representation
- URL: http://arxiv.org/abs/2106.05656v1
- Date: Thu, 10 Jun 2021 11:05:18 GMT
- Title: MST: Masked Self-Supervised Transformer for Visual Representation
- Authors: Zhaowen Li, Zhiyang Chen, Fan Yang, Wei Li, Yousong Zhu, Chaoyang
Zhao, Rui Deng, Liwei Wu, Rui Zhao, Ming Tang, Jinqiao Wang
- Abstract summary: Transformer has been widely used for self-supervised pre-training in Natural Language Processing (NLP)
We present a novel Masked Self-supervised Transformer approach named MST, which can explicitly capture the local context of an image.
MST achieves Top-1 accuracy of 76.9% with DeiT-S only using 300-epoch pre-training by linear evaluation.
- Score: 52.099722121603506
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer has been widely used for self-supervised pre-training in Natural
Language Processing (NLP) and achieved great success. However, it has not been
fully explored in visual self-supervised learning. Meanwhile, previous methods
only consider the high-level feature and learning representation from a global
perspective, which may fail to transfer to the downstream dense prediction
tasks focusing on local features. In this paper, we present a novel Masked
Self-supervised Transformer approach named MST, which can explicitly capture
the local context of an image while preserving the global semantic information.
Specifically, inspired by the Masked Language Modeling (MLM) in NLP, we propose
a masked token strategy based on the multi-head self-attention map, which
dynamically masks some tokens of local patches without damaging the crucial
structure for self-supervised learning. More importantly, the masked tokens
together with the remaining tokens are further recovered by a global image
decoder, which preserves the spatial information of the image and is more
friendly to the downstream dense prediction tasks. The experiments on multiple
datasets demonstrate the effectiveness and generality of the proposed method.
For instance, MST achieves Top-1 accuracy of 76.9% with DeiT-S only using
300-epoch pre-training by linear evaluation, which outperforms supervised
methods with the same epoch by 0.4% and its comparable variant DINO by 1.0\%.
For dense prediction tasks, MST also achieves 42.7% mAP on MS COCO object
detection and 74.04% mIoU on Cityscapes segmentation only with 100-epoch
pre-training.
Related papers
- Symmetric masking strategy enhances the performance of Masked Image Modeling [0.0]
Masked Image Modeling (MIM) is a technique in self-supervised learning that focuses on acquiring detailed visual representations from unlabeled images.
We propose a new masking strategy that effectively helps the model capture global and local features.
Based on this masking strategy, SymMIM, our proposed training pipeline for MIM is introduced.
arXiv Detail & Related papers (2024-08-23T00:15:43Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - Multi-Level Contrastive Learning for Dense Prediction Task [59.591755258395594]
We present Multi-Level Contrastive Learning for Dense Prediction Task (MCL), an efficient self-supervised method for learning region-level feature representation for dense prediction tasks.
Our method is motivated by the three key factors in detection: localization, scale consistency and recognition.
Our method consistently outperforms the recent state-of-the-art methods on various datasets with significant margins.
arXiv Detail & Related papers (2023-04-04T17:59:04Z) - Efficient Masked Autoencoders with Self-Consistency [34.7076436760695]
Masked image modeling (MIM) has been recognized as a strong self-supervised pre-training method in computer vision.
We propose efficient masked autoencoders with self-consistency (EMAE) to improve the pre-training efficiency.
EMAE consistently obtains state-of-the-art transfer ability on a variety of downstream tasks, such as image classification, object detection, and semantic segmentation.
arXiv Detail & Related papers (2023-02-28T09:21:12Z) - Masked Autoencoding for Scalable and Generalizable Decision Making [93.84855114717062]
MaskDP is a simple and scalable self-supervised pretraining method for reinforcement learning and behavioral cloning.
We find that a MaskDP model gains the capability of zero-shot transfer to new BC tasks, such as single and multiple goal reaching.
arXiv Detail & Related papers (2022-11-23T07:04:41Z) - AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with
Masked Autoencoders [44.87786478095987]
Masked Autoencoders learn general representations for image, text, audio, video, etc., by masked input data from tokens of the visible data.
This paper proposes an adaptive masking strategy for MAEs that is end-to-end trainable.
AdaMAE samples visible tokens based on the semantic context using an auxiliary sampling network.
arXiv Detail & Related papers (2022-11-16T18:59:48Z) - SdAE: Self-distillated Masked Autoencoder [95.3684955370897]
Self-distillated masked AutoEncoder network SdAE is proposed in this paper.
With only 300 epochs pre-training, a vanilla ViT-Base model achieves an 84.1% fine-tuning accuracy on ImageNet-1k classification.
arXiv Detail & Related papers (2022-07-31T15:07:25Z) - Masked Autoencoders for Point Cloud Self-supervised Learning [27.894216954216716]
We propose a neat scheme of masked autoencoders for point cloud self-supervised learning.
We divide the input point cloud into irregular point patches and randomly mask them at a high ratio.
A standard Transformer based autoencoder, with an asymmetric design and a shifting mask tokens operation, learns high-level latent features from unmasked point patches.
arXiv Detail & Related papers (2022-03-13T09:23:39Z) - Contextualized Spatio-Temporal Contrastive Learning with
Self-Supervision [106.77639982059014]
We present ConST-CL framework to effectively learn-temporally fine-grained representations.
We first design a region-based self-supervised task which requires the model to learn to transform instance representations from one view to another guided by context features.
We then introduce a simple design that effectively reconciles the simultaneous learning of both holistic and local representations.
arXiv Detail & Related papers (2021-12-09T19:13:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.