On the Surprising Effectiveness of Transformers in Low-Labeled Video
Recognition
- URL: http://arxiv.org/abs/2209.07474v1
- Date: Thu, 15 Sep 2022 17:12:30 GMT
- Title: On the Surprising Effectiveness of Transformers in Low-Labeled Video
Recognition
- Authors: Farrukh Rahman, \"Omer Mubarek, Zsolt Kira
- Abstract summary: Video vision transformers have been shown to be competitive with convolution-based methods (CNNs) broadly across multiple vision tasks.
Our work empirically explores the low data regime for video classification and discovers that, surprisingly, transformers perform extremely well in the low-labeled video setting.
We even show that using just the labeled data, transformers significantly outperform complex semi-supervised CNN methods that leverage large-scale unlabeled data as well.
- Score: 18.557920268145818
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently vision transformers have been shown to be competitive with
convolution-based methods (CNNs) broadly across multiple vision tasks. The less
restrictive inductive bias of transformers endows greater representational
capacity in comparison with CNNs. However, in the image classification setting
this flexibility comes with a trade-off with respect to sample efficiency,
where transformers require ImageNet-scale training. This notion has carried
over to video where transformers have not yet been explored for video
classification in the low-labeled or semi-supervised settings. Our work
empirically explores the low data regime for video classification and discovers
that, surprisingly, transformers perform extremely well in the low-labeled
video setting compared to CNNs. We specifically evaluate video vision
transformers across two contrasting video datasets (Kinetics-400 and
SomethingSomething-V2) and perform thorough analysis and ablation studies to
explain this observation using the predominant features of video transformer
architectures. We even show that using just the labeled data, transformers
significantly outperform complex semi-supervised CNN methods that leverage
large-scale unlabeled data as well. Our experiments inform our recommendation
that semi-supervised learning video work should consider the use of video
transformers in the future.
Related papers
- Deep Laparoscopic Stereo Matching with Transformers [46.18206008056612]
Self-attention mechanism, successfully employed with the transformer structure, is shown promise in many computer vision tasks.
We propose a new hybrid deep stereo matching framework (HybridStereoNet) that combines the best of the CNN and the transformer in a unified design.
arXiv Detail & Related papers (2022-07-25T12:54:32Z) - Video Transformers: A Survey [42.314208650554264]
We study the contributions and trends for adapting Transformers to model video data.
Specifically, we delve into how videos are embedded and tokenized, finding a very widspread use of large CNN backbones.
Also, we analyse the self-supervised losses used to train Video Transformers, which to date are mostly constrained to contrastive approaches.
arXiv Detail & Related papers (2022-01-16T07:31:55Z) - Semi-Supervised Vision Transformers [76.83020291497895]
We study the training of Vision Transformers for semi-supervised image classification.
We find Vision Transformers perform poorly on a semi-supervised ImageNet setting.
CNNs achieve superior results in small labeled data regime.
arXiv Detail & Related papers (2021-11-22T09:28:13Z) - Can Vision Transformers Perform Convolution? [78.42076260340869]
We prove that a single ViT layer with image patches as the input can perform any convolution operation constructively.
We provide a lower bound on the number of heads for Vision Transformers to express CNNs.
arXiv Detail & Related papers (2021-11-02T03:30:17Z) - Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets.
Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z) - Towards Training Stronger Video Vision Transformers for
EPIC-KITCHENS-100 Action Recognition [27.760120524736678]
We present empirical results for training a stronger video vision transformer on the EPIC-KITCHENS-100 Action Recognition dataset.
A single ViViT model achieves the performance of 47.4% on the validation set of EPIC-KITCHENS-100 dataset.
We find that video transformers are especially good at predicting the noun in the verb-noun action prediction task.
arXiv Detail & Related papers (2021-06-09T13:26:02Z) - Gaze Estimation using Transformer [14.26674946195107]
We consider two forms of vision transformer which are pure transformers and hybrid transformers.
We first follow the popular ViT and employ a pure transformer to estimate gaze from images.
On the other hand, we preserve the convolutional layers and integrate CNNs as well as transformers.
arXiv Detail & Related papers (2021-05-30T04:06:29Z) - Token Labeling: Training a 85.4% Top-1 Accuracy Vision Transformer with
56M Parameters on ImageNet [86.95679590801494]
We explore the potential of vision transformers in ImageNet classification by developing a bag of training techniques.
We show that by slightly tuning the structure of vision transformers and introducing token labeling, our models are able to achieve better results than the CNN counterparts.
arXiv Detail & Related papers (2021-04-22T04:43:06Z) - Going deeper with Image Transformers [102.61950708108022]
We build and optimize deeper transformer networks for image classification.
We make two transformers architecture changes that significantly improve the accuracy of deep transformers.
Our best model establishes the new state of the art on Imagenet with Reassessed labels and Imagenet-V2 / match frequency.
arXiv Detail & Related papers (2021-03-31T17:37:32Z) - A Survey on Visual Transformer [126.56860258176324]
Transformer is a type of deep neural network mainly based on the self-attention mechanism.
In this paper, we review these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages.
arXiv Detail & Related papers (2020-12-23T09:37:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.