Self-Supervised Learning with Swin Transformers
- URL: http://arxiv.org/abs/2105.04553v2
- Date: Tue, 11 May 2021 17:28:00 GMT
- Title: Self-Supervised Learning with Swin Transformers
- Authors: Zhenda Xie, Yutong Lin, Zhuliang Yao, Zheng Zhang, Qi Dai, Yue Cao,
Han Hu
- Abstract summary: We present a self-supervised learning approach called MoBY, with Vision Transformers as its backbone architecture.
The approach basically has no new inventions, which is combined from MoCo v2 and BYOL.
The performance is slightly better than recent works of MoCo v3 and DINO which adopt DeiT as the backbone, but with much lighter tricks.
- Score: 24.956637957269926
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We are witnessing a modeling shift from CNN to Transformers in computer
vision. In this work, we present a self-supervised learning approach called
MoBY, with Vision Transformers as its backbone architecture. The approach
basically has no new inventions, which is combined from MoCo v2 and BYOL and
tuned to achieve reasonably high accuracy on ImageNet-1K linear evaluation:
72.8% and 75.0% top-1 accuracy using DeiT-S and Swin-T, respectively, by
300-epoch training. The performance is slightly better than recent works of
MoCo v3 and DINO which adopt DeiT as the backbone, but with much lighter
tricks.
More importantly, the general-purpose Swin Transformer backbone enables us to
also evaluate the learnt representations on downstream tasks such as object
detection and semantic segmentation, in contrast to a few recent approaches
built on ViT/DeiT which only report linear evaluation results on ImageNet-1K
due to ViT/DeiT not tamed for these dense prediction tasks. We hope our results
can facilitate more comprehensive evaluation of self-supervised learning
methods designed for Transformer architectures. Our code and models are
available at https://github.com/SwinTransformer/Transformer-SSL, which will be
continually enriched.
Related papers
- Three things everyone should know about Vision Transformers [67.30250766591405]
transformer architectures have rapidly gained traction in computer vision.
We offer three insights based on simple and easy to implement variants of vision transformers.
We evaluate the impact of these design choices using the ImageNet-1k dataset, and confirm our findings on the ImageNet-v2 test set.
arXiv Detail & Related papers (2022-03-18T08:23:03Z) - ViT-P: Rethinking Data-efficient Vision Transformers from Locality [9.515925867530262]
We make vision transformers as data-efficient as convolutional neural networks by introducing multi-focal attention bias.
Inspired by the attention distance in a well-trained ViT, we constrain the self-attention of ViT to have multi-scale localized receptive field.
On Cifar100, our ViT-P Base model achieves the state-of-the-art accuracy (83.16%) trained from scratch.
arXiv Detail & Related papers (2022-03-04T14:49:48Z) - Investigating Transfer Learning Capabilities of Vision Transformers and
CNNs by Fine-Tuning a Single Trainable Block [0.0]
transformer-based architectures are surpassing the state-of-the-art set by CNN architectures in accuracy but are computationally very expensive to train from scratch.
We study it's transfer learning capabilities and compare it with CNNs so that we can understand which architecture is better when applied to real world problems with small data.
We find out that transformers-based architectures not only achieve higher accuracy than CNNs but some transformers even achieve this feat with around 4 times lesser number of parameters.
arXiv Detail & Related papers (2021-10-11T13:43:03Z) - ConvNets vs. Transformers: Whose Visual Representations are More
Transferable? [49.62201738334348]
We investigate the transfer learning ability of ConvNets and vision transformers in 15 single-task and multi-task performance evaluations.
We observe consistent advantages of Transformer-based backbones on 13 downstream tasks.
arXiv Detail & Related papers (2021-08-11T16:20:38Z) - AutoFormer: Searching Transformers for Visual Recognition [97.60915598958968]
We propose a new one-shot architecture search framework, namely AutoFormer, dedicated to vision transformer search.
AutoFormer entangles the weights of different blocks in the same layers during supernet training.
We show that AutoFormer-tiny/small/base achieve 74.7%/81.7%/82.4% top-1 accuracy on ImageNet with 5.7M/22.9M/53.7M parameters.
arXiv Detail & Related papers (2021-07-01T17:59:30Z) - Scaling Vision Transformers [82.08465256393514]
We study how Vision Transformers scale and characterize the relationships between error rate, data, and compute.
We train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy.
The model also performs well on few-shot learning, for example, attaining 84.86% top-1 accuracy on ImageNet with only 10 examples per class.
arXiv Detail & Related papers (2021-06-08T17:47:39Z) - Token Labeling: Training a 85.4% Top-1 Accuracy Vision Transformer with
56M Parameters on ImageNet [86.95679590801494]
We explore the potential of vision transformers in ImageNet classification by developing a bag of training techniques.
We show that by slightly tuning the structure of vision transformers and introducing token labeling, our models are able to achieve better results than the CNN counterparts.
arXiv Detail & Related papers (2021-04-22T04:43:06Z) - SiT: Self-supervised vIsion Transformer [23.265568744478333]
In natural language processing (NLP) self-supervised learning and transformers are already the methods of choice.
We propose Self-supervised vIsion Transformers (SiT) and discuss several self-supervised training mechanisms to obtain a pretext model.
We show that a pretrained SiT can be finetuned for a downstream classification task on small scale datasets.
arXiv Detail & Related papers (2021-04-08T08:34:04Z) - DeepViT: Towards Deeper Vision Transformer [92.04063170357426]
Vision transformers (ViTs) have been successfully applied in image classification tasks recently.
We show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper.
We propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity.
arXiv Detail & Related papers (2021-03-22T14:32:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.