Related papers: PatchRot: A Self-Supervised Technique for Training Vision Transformers

PatchRot: A Self-Supervised Technique for Training Vision Transformers

URL: http://arxiv.org/abs/2210.15722v1
Date: Thu, 27 Oct 2022 18:55:12 GMT
Title: PatchRot: A Self-Supervised Technique for Training Vision Transformers
Authors: Sachin Chhabra, Prabal Bijoy Dutta, Hemanth Venkateswara and Baoxin Li
Abstract summary: Vision transformers require a huge amount of labeled data to outperform convolutional neural networks. We propose a self-supervised technique PatchRot that is crafted for vision transformers. Our experiments on different datasets showcase PatchRot training learns rich features which outperform supervised learning and compared baseline.
Score: 22.571734100855046
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision transformers require a huge amount of labeled data to outperform convolutional neural networks. However, labeling a huge dataset is a very expensive process. Self-supervised learning techniques alleviate this problem by learning features similar to supervised learning in an unsupervised way. In this paper, we propose a self-supervised technique PatchRot that is crafted for vision transformers. PatchRot rotates images and image patches and trains the network to predict the rotation angles. The network learns to extract both global and local features from an image. Our extensive experiments on different datasets showcase PatchRot training learns rich features which outperform supervised learning and compared baseline.

Related papers

An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels [65.64402188506644]
vanilla Transformers can operate by treating each individual pixel as a token and achieve highly performant results. We mainly showcase the effectiveness of pixels-as-tokens across three well-studied tasks in computer vision.
arXiv Detail & Related papers (2024-06-13T17:59:58Z)
TransY-Net:Learning Fully Transformer Networks for Change Detection of Remote Sensing Images [64.63004710817239]
We propose a novel Transformer-based learning framework named TransY-Net for remote sensing image CD. It improves the feature extraction from a global view and combines multi-level visual features in a pyramid manner. Our proposed method achieves a new state-of-the-art performance on four optical and two SAR image CD benchmarks.
arXiv Detail & Related papers (2023-10-22T07:42:19Z)
Gated Self-supervised Learning For Improving Supervised Learning [1.784933900656067]
We propose a novel approach to self-supervised learning for image classification using several localizable augmentations with the combination of the gating method. Our approach uses flip and shuffle channel augmentations in addition to the rotation, allowing the model to learn rich features from the data.
arXiv Detail & Related papers (2023-01-14T09:32:12Z)
An Empirical Study Of Self-supervised Learning Approaches For Object Detection With Transformers [0.0]
We explore self-supervised methods based on image reconstruction, masked image modeling and jigsaw. Preliminary experiments in the iSAID dataset demonstrate faster convergence of DETR in the initial epochs in both pretraining and multi-task learning settings.
arXiv Detail & Related papers (2022-05-11T14:39:27Z)
Attribute Surrogates Learning and Spectral Tokens Pooling in Transformers for Few-shot Learning [50.95116994162883]
Vision transformers have been thought of as a promising alternative to convolutional neural networks for visual recognition. This paper presents hierarchically cascaded transformers that exploit intrinsic image structures through spectral tokens pooling. HCTransformers surpass the DINO baseline by a large margin of 9.7% 5-way 1-shot accuracy and 9.17% 5-way 5-shot accuracy on miniImageNet.
arXiv Detail & Related papers (2022-03-17T03:49:58Z)
ViT-P: Rethinking Data-efficient Vision Transformers from Locality [9.515925867530262]
We make vision transformers as data-efficient as convolutional neural networks by introducing multi-focal attention bias. Inspired by the attention distance in a well-trained ViT, we constrain the self-attention of ViT to have multi-scale localized receptive field. On Cifar100, our ViT-P Base model achieves the state-of-the-art accuracy (83.16%) trained from scratch.
arXiv Detail & Related papers (2022-03-04T14:49:48Z)
Semi-Supervised Vision Transformers [76.83020291497895]
We study the training of Vision Transformers for semi-supervised image classification. We find Vision Transformers perform poorly on a semi-supervised ImageNet setting. CNNs achieve superior results in small labeled data regime.
arXiv Detail & Related papers (2021-11-22T09:28:13Z)
Improve Vision Transformers Training by Suppressing Over-smoothing [28.171262066145612]
Introducing the transformer structure into computer vision tasks holds the promise of yielding a better speed-accuracy trade-off than traditional convolution networks. However, directly training vanilla transformers on vision tasks has been shown to yield unstable and sub-optimal results. Recent works propose to modify transformer structures by incorporating convolutional layers to improve the performance on vision tasks.
arXiv Detail & Related papers (2021-04-26T17:43:04Z)
SiT: Self-supervised vIsion Transformer [23.265568744478333]
In natural language processing (NLP) self-supervised learning and transformers are already the methods of choice. We propose Self-supervised vIsion Transformers (SiT) and discuss several self-supervised training mechanisms to obtain a pretext model. We show that a pretrained SiT can be finetuned for a downstream classification task on small scale datasets.
arXiv Detail & Related papers (2021-04-08T08:34:04Z)
Auto-Rectify Network for Unsupervised Indoor Depth Estimation [119.82412041164372]
We establish that the complex ego-motions exhibited in handheld settings are a critical obstacle for learning depth. We propose a data pre-processing method that rectifies training images by removing their relative rotations for effective learning. Our results outperform the previous unsupervised SOTA method by a large margin on the challenging NYUv2 dataset.
arXiv Detail & Related papers (2020-06-04T08:59:17Z)
Supervised and Unsupervised Learning of Parameterized Color Enhancement [112.88623543850224]
We tackle the problem of color enhancement as an image translation task using both supervised and unsupervised learning. We achieve state-of-the-art results compared to both supervised (paired data) and unsupervised (unpaired data) image enhancement methods on the MIT-Adobe FiveK benchmark. We show the generalization capability of our method, by applying it on photos from the early 20th century and to dark video frames.
arXiv Detail & Related papers (2019-12-30T13:57:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.