SiT: Self-supervised vIsion Transformer
- URL: http://arxiv.org/abs/2104.03602v1
- Date: Thu, 8 Apr 2021 08:34:04 GMT
- Title: SiT: Self-supervised vIsion Transformer
- Authors: Sara Atito and Muhammad Awais and Josef Kittler
- Abstract summary: In natural language processing (NLP) self-supervised learning and transformers are already the methods of choice.
We propose Self-supervised vIsion Transformers (SiT) and discuss several self-supervised training mechanisms to obtain a pretext model.
We show that a pretrained SiT can be finetuned for a downstream classification task on small scale datasets.
- Score: 23.265568744478333
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised learning methods are gaining increasing traction in computer
vision due to their recent success in reducing the gap with supervised
learning. In natural language processing (NLP) self-supervised learning and
transformers are already the methods of choice. The recent literature suggests
that the transformers are becoming increasingly popular also in computer
vision. So far, the vision transformers have been shown to work well when
pretrained either using a large scale supervised data or with some kind of
co-supervision, e.g. in terms of teacher network. These supervised pretrained
vision transformers achieve very good results in downstream tasks with minimal
changes. In this work we investigate the merits of self-supervised learning for
pretraining image/vision transformers and then using them for downstream
classification tasks. We propose Self-supervised vIsion Transformers (SiT) and
discuss several self-supervised training mechanisms to obtain a pretext model.
The architectural flexibility of SiT allows us to use it as an autoencoder and
work with multiple self-supervised tasks seamlessly. We show that a pretrained
SiT can be finetuned for a downstream classification task on small scale
datasets, consisting of a few thousand images rather than several millions. The
proposed approach is evaluated on standard datasets using common protocols. The
results demonstrate the strength of the transformers and their suitability for
self-supervised learning. We outperformed existing self-supervised learning
methods by large margin. We also observed that SiT is good for few shot
learning and also showed that it is learning useful representation by simply
training a linear classifier on top of the learned features from SiT.
Pretraining, finetuning, and evaluation codes will be available under:
https://github.com/Sara-Ahmed/SiT.
Related papers
- Weight Copy and Low-Rank Adaptation for Few-Shot Distillation of Vision Transformers [22.1372572833618]
We propose a novel few-shot feature distillation approach for vision transformers.
We first copy the weights from intermittent layers of existing vision transformers into shallower architectures (students)
Next, we employ an enhanced version of Low-Rank Adaptation (LoRA) to distill knowledge into the student in a few-shot scenario.
arXiv Detail & Related papers (2024-04-14T18:57:38Z) - Transformers for Supervised Online Continual Learning [11.270594318662233]
We propose a method that leverages transformers' in-context learning capabilities for online continual learning.
Our method demonstrates significant improvements over previous state-of-the-art results on CLOC, a challenging large-scale real-world benchmark for image geo-localization.
arXiv Detail & Related papers (2024-03-03T16:12:20Z) - Emergent Agentic Transformer from Chain of Hindsight Experience [96.56164427726203]
We show that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
This is the first time that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
arXiv Detail & Related papers (2023-05-26T00:43:02Z) - Transformers learn in-context by gradient descent [58.24152335931036]
Training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations.
We show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass.
arXiv Detail & Related papers (2022-12-15T09:21:21Z) - SERE: Exploring Feature Self-relation for Self-supervised Transformer [79.5769147071757]
Vision transformers (ViT) have strong representation ability with spatial self-attention and channel-level feedforward networks.
Recent works reveal that self-supervised learning helps unleash the great potential of ViT.
We observe that relational modeling on spatial and channel dimensions distinguishes ViT from other networks.
arXiv Detail & Related papers (2022-06-10T15:25:00Z) - An Empirical Study Of Self-supervised Learning Approaches For Object
Detection With Transformers [0.0]
We explore self-supervised methods based on image reconstruction, masked image modeling and jigsaw.
Preliminary experiments in the iSAID dataset demonstrate faster convergence of DETR in the initial epochs in both pretraining and multi-task learning settings.
arXiv Detail & Related papers (2022-05-11T14:39:27Z) - DeiT III: Revenge of the ViT [56.46810490275699]
A Vision Transformer (ViT) is a simple neural architecture amenable to serve several computer vision tasks.
Recent works show that ViTs benefit from self-supervised pre-training, in particular BerT-like pre-training like BeiT.
arXiv Detail & Related papers (2022-04-14T17:13:44Z) - Semi-Supervised Vision Transformers [76.83020291497895]
We study the training of Vision Transformers for semi-supervised image classification.
We find Vision Transformers perform poorly on a semi-supervised ImageNet setting.
CNNs achieve superior results in small labeled data regime.
arXiv Detail & Related papers (2021-11-22T09:28:13Z) - Self-Supervised Learning with Swin Transformers [24.956637957269926]
We present a self-supervised learning approach called MoBY, with Vision Transformers as its backbone architecture.
The approach basically has no new inventions, which is combined from MoCo v2 and BYOL.
The performance is slightly better than recent works of MoCo v3 and DINO which adopt DeiT as the backbone, but with much lighter tricks.
arXiv Detail & Related papers (2021-05-10T17:59:45Z) - An Empirical Study of Training Self-Supervised Visual Transformers [70.27107708555185]
We study the effects of several fundamental components for training self-supervised Visual Transformers.
We reveal that these results are indeed partial failure, and they can be improved when training is made more stable.
arXiv Detail & Related papers (2021-04-05T17:59:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.