Towards Training Stronger Video Vision Transformers for
EPIC-KITCHENS-100 Action Recognition
- URL: http://arxiv.org/abs/2106.05058v1
- Date: Wed, 9 Jun 2021 13:26:02 GMT
- Title: Towards Training Stronger Video Vision Transformers for
EPIC-KITCHENS-100 Action Recognition
- Authors: Ziyuan Huang, Zhiwu Qing, Xiang Wang, Yutong Feng, Shiwei Zhang,
Jianwen Jiang, Zhurong Xia, Mingqian Tang, Nong Sang, Marcelo H. Ang Jr
- Abstract summary: We present empirical results for training a stronger video vision transformer on the EPIC-KITCHENS-100 Action Recognition dataset.
A single ViViT model achieves the performance of 47.4% on the validation set of EPIC-KITCHENS-100 dataset.
We find that video transformers are especially good at predicting the noun in the verb-noun action prediction task.
- Score: 27.760120524736678
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the recent surge in the research of vision transformers, they have
demonstrated remarkable potential for various challenging computer vision
applications, such as image recognition, point cloud classification as well as
video understanding. In this paper, we present empirical results for training a
stronger video vision transformer on the EPIC-KITCHENS-100 Action Recognition
dataset. Specifically, we explore training techniques for video vision
transformers, such as augmentations, resolutions as well as initialization,
etc. With our training recipe, a single ViViT model achieves the performance of
47.4\% on the validation set of EPIC-KITCHENS-100 dataset, outperforming what
is reported in the original paper by 3.4%. We found that video transformers are
especially good at predicting the noun in the verb-noun action prediction task.
This makes the overall action prediction accuracy of video transformers notably
higher than convolutional ones. Surprisingly, even the best video transformers
underperform the convolutional networks on the verb prediction. Therefore, we
combine the video vision transformers and some of the convolutional video
networks and present our solution to the EPIC-KITCHENS-100 Action Recognition
competition.
Related papers
- On Convolutional Vision Transformers for Yield Prediction [0.0]
The Convolution vision Transformer (CvT) is being tested to evaluate vision Transformers that are currently achieving state-of-the-art results in many other vision tasks.
It performs worse than widely tested methods such as XGBoost and CNNs, but shows that Transformers have potential to improve yield prediction.
arXiv Detail & Related papers (2024-02-08T10:50:12Z) - SVFormer: Semi-supervised Video Transformer for Action Recognition [88.52042032347173]
We introduce SVFormer, which adopts a steady pseudo-labeling framework to cope with unlabeled video samples.
In addition, we propose a temporal warping to cover the complex temporal variation in videos.
In particular, SVFormer outperforms the state-of-the-art by 31.5% with fewer training epochs under the 1% labeling rate of Kinetics-400.
arXiv Detail & Related papers (2022-11-23T18:58:42Z) - On the Surprising Effectiveness of Transformers in Low-Labeled Video
Recognition [18.557920268145818]
Video vision transformers have been shown to be competitive with convolution-based methods (CNNs) broadly across multiple vision tasks.
Our work empirically explores the low data regime for video classification and discovers that, surprisingly, transformers perform extremely well in the low-labeled video setting.
We even show that using just the labeled data, transformers significantly outperform complex semi-supervised CNN methods that leverage large-scale unlabeled data as well.
arXiv Detail & Related papers (2022-09-15T17:12:30Z) - Can Vision Transformers Perform Convolution? [78.42076260340869]
We prove that a single ViT layer with image patches as the input can perform any convolution operation constructively.
We provide a lower bound on the number of heads for Vision Transformers to express CNNs.
arXiv Detail & Related papers (2021-11-02T03:30:17Z) - Token Labeling: Training a 85.4% Top-1 Accuracy Vision Transformer with
56M Parameters on ImageNet [86.95679590801494]
We explore the potential of vision transformers in ImageNet classification by developing a bag of training techniques.
We show that by slightly tuning the structure of vision transformers and introducing token labeling, our models are able to achieve better results than the CNN counterparts.
arXiv Detail & Related papers (2021-04-22T04:43:06Z) - CvT: Introducing Convolutions to Vision Transformers [44.74550305869089]
Convolutional vision Transformer (CvT) improves Vision Transformer (ViT) in performance and efficiency.
New architecture introduces convolutions into ViT to yield the best of both designs.
arXiv Detail & Related papers (2021-03-29T17:58:22Z) - Vision Transformers for Dense Prediction [77.34726150561087]
We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks.
Our experiments show that this architecture yields substantial improvements on dense prediction tasks.
arXiv Detail & Related papers (2021-03-24T18:01:17Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z) - A Survey on Visual Transformer [126.56860258176324]
Transformer is a type of deep neural network mainly based on the self-attention mechanism.
In this paper, we review these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages.
arXiv Detail & Related papers (2020-12-23T09:37:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.