Where are my Neighbors? Exploiting Patches Relations in Self-Supervised
Vision Transformer
- URL: http://arxiv.org/abs/2206.00481v1
- Date: Wed, 1 Jun 2022 13:25:32 GMT
- Title: Where are my Neighbors? Exploiting Patches Relations in Self-Supervised
Vision Transformer
- Authors: Guglielmo Camporese, Elena Izzo, Lamberto Ballan
- Abstract summary: We propose a simple but still effective self-supervised learning (SSL) strategy to train Vision Transformers (ViTs)
We define a set of SSL tasks based on relations of image patches that the model has to solve before or jointly during the downstream training.
Our RelViT model optimize all the output tokens of the transformer encoder that are related to the image patches, thus exploiting more training signal at each training step.
- Score: 3.158346511479111
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Vision Transformers (ViTs) enabled the use of transformer architecture on
vision tasks showing impressive performances when trained on big datasets.
However, on relatively small datasets, ViTs are less accurate given their lack
of inductive bias. To this end, we propose a simple but still effective
self-supervised learning (SSL) strategy to train ViTs, that without any
external annotation, can significantly improve the results. Specifically, we
define a set of SSL tasks based on relations of image patches that the model
has to solve before or jointly during the downstream training. Differently from
ViT, our RelViT model optimizes all the output tokens of the transformer
encoder that are related to the image patches, thus exploiting more training
signal at each training step. We investigated our proposed methods on several
image benchmarks finding that RelViT improves the SSL state-of-the-art methods
by a large margin, especially on small datasets.
Related papers
- Semantic Graph Consistency: Going Beyond Patches for Regularizing Self-Supervised Vision Transformers [5.359378066251386]
Self-supervised learning with vision transformers (ViTs) has proven effective for representation learning.
Existing ViT-based SSL architectures do not fully exploit the ViT backbone.
We introduce a novel Semantic Graph Consistency (SGC) module to regularize ViT-based SSL methods and leverage patch tokens effectively.
arXiv Detail & Related papers (2024-06-18T06:36:44Z) - Exploring Self-Supervised Vision Transformers for Deepfake Detection: A Comparative Analysis [38.074487843137064]
This paper investigates the effectiveness of self-supervised pre-trained vision transformers (ViTs) compared to supervised pre-trained ViTs and conventional neural networks (ConvNets) for detecting facial deepfake images and videos.
It examines their potential for improved generalization and explainability, especially with limited training data.
By leveraging SSL ViTs for deepfake detection with modest data and partial fine-tuning, we find comparable adaptability to deepfake detection and explainability via the attention mechanism.
arXiv Detail & Related papers (2024-05-01T07:16:49Z) - Exploring Efficient Few-shot Adaptation for Vision Transformers [70.91692521825405]
We propose a novel efficient Transformer Tuning (eTT) method that facilitates finetuning ViTs in the Few-shot Learning tasks.
Key novelties come from the newly presented Attentive Prefix Tuning (APT) and Domain Residual Adapter (DRA)
We conduct extensive experiments to show the efficacy of our model.
arXiv Detail & Related papers (2023-01-06T08:42:05Z) - How to Train Vision Transformer on Small-scale Datasets? [4.56717163175988]
In contrast to convolutional neural networks, Vision Transformer lacks inherent inductive biases.
We show that self-supervised inductive biases can be learned directly from small-scale datasets.
This allows to train these models without large-scale pre-training, changes to model architecture or loss functions.
arXiv Detail & Related papers (2022-10-13T17:59:19Z) - Self-Promoted Supervision for Few-Shot Transformer [178.52948452353834]
Self-promoted sUpervisioN (SUN) is a few-shot learning framework for vision transformers (ViTs)
SUN pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token.
Experiments show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts.
arXiv Detail & Related papers (2022-03-14T12:53:27Z) - ViT-P: Rethinking Data-efficient Vision Transformers from Locality [9.515925867530262]
We make vision transformers as data-efficient as convolutional neural networks by introducing multi-focal attention bias.
Inspired by the attention distance in a well-trained ViT, we constrain the self-attention of ViT to have multi-scale localized receptive field.
On Cifar100, our ViT-P Base model achieves the state-of-the-art accuracy (83.16%) trained from scratch.
arXiv Detail & Related papers (2022-03-04T14:49:48Z) - PreViTS: Contrastive Pretraining with Video Tracking Supervision [53.73237606312024]
PreViTS is an unsupervised SSL framework for selecting clips containing the same object.
PreViTS spatially constrains the frame regions to learn from and trains the model to locate meaningful objects.
We train a momentum contrastive (MoCo) encoder on VGG-Sound and Kinetics-400 datasets with PreViTS.
arXiv Detail & Related papers (2021-12-01T19:49:57Z) - How to train your ViT? Data, Augmentation, and Regularization in Vision
Transformers [74.06040005144382]
Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications.
We conduct a systematic empirical study in order to better understand the interplay between the amount of training data, AugReg, model size and compute budget.
We train ViT models of various sizes on the public ImageNet-21k dataset which either match or outperform their counterparts trained on the larger, but not publicly available JFT-300M dataset.
arXiv Detail & Related papers (2021-06-18T17:58:20Z) - Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets.
Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.