Abstract: Transformer has been widely used for self-supervised pre-training in Natural
Language Processing (NLP) and achieved great success. However, it has not been
fully explored in visual self-supervised learning. Meanwhile, previous methods
only consider the high-level feature and learning representation from a global
perspective, which may fail to transfer to the downstream dense prediction
tasks focusing on local features. In this paper, we present a novel Masked
Self-supervised Transformer approach named MST, which can explicitly capture
the local context of an image while preserving the global semantic information.
Specifically, inspired by the Masked Language Modeling (MLM) in NLP, we propose
a masked token strategy based on the multi-head self-attention map, which
dynamically masks some tokens of local patches without damaging the crucial
structure for self-supervised learning. More importantly, the masked tokens
together with the remaining tokens are further recovered by a global image
decoder, which preserves the spatial information of the image and is more
friendly to the downstream dense prediction tasks. The experiments on multiple
datasets demonstrate the effectiveness and generality of the proposed method.
For instance, MST achieves Top-1 accuracy of 76.9% with DeiT-S only using
300-epoch pre-training by linear evaluation, which outperforms supervised
methods with the same epoch by 0.4% and its comparable variant DINO by 1.0\%.
For dense prediction tasks, MST also achieves 42.7% mAP on MS COCO object
detection and 74.04% mIoU on Cityscapes segmentation only with 100-epoch