ViViT: A Video Vision Transformer
- URL: http://arxiv.org/abs/2103.15691v1
- Date: Mon, 29 Mar 2021 15:27:17 GMT
- Title: ViViT: A Video Vision Transformer
- Authors: Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario
Lu\v{c}i\'c, Cordelia Schmid
- Abstract summary: We present pure-transformer based models for video classification.
Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers.
We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
- Score: 75.74690759089529
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present pure-transformer based models for video classification, drawing
upon the recent success of such models in image classification. Our model
extracts spatio-temporal tokens from the input video, which are then encoded by
a series of transformer layers. In order to handle the long sequences of tokens
encountered in video, we propose several, efficient variants of our model which
factorise the spatial- and temporal-dimensions of the input. Although
transformer-based models are known to only be effective when large training
datasets are available, we show how we can effectively regularise the model
during training and leverage pretrained image models to be able to train on
comparatively small datasets. We conduct thorough ablation studies, and achieve
state-of-the-art results on multiple video classification benchmarks including
Kinetics 400 and 600, Epic Kitchens, Something-Something v2 and Moments in
Time, outperforming prior methods based on deep 3D convolutional networks. To
facilitate further research, we will release code and models.
Related papers
- Frozen CLIP Models are Efficient Video Learners [86.73871814176795]
Video recognition has been dominated by the end-to-end learning paradigm.
Recent advances in Contrastive Vision-Language Pre-training pave the way for a new route for visual recognition tasks.
We present Efficient Video Learning -- an efficient framework for directly training high-quality video recognition models.
arXiv Detail & Related papers (2022-08-06T17:38:25Z) - Revisiting Classifier: Transferring Vision-Language Models for Video
Recognition [102.93524173258487]
Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research.
In this study, we focus on transferring knowledge for video classification tasks.
We utilize the well-pretrained language model to generate good semantic target for efficient transferring learning.
arXiv Detail & Related papers (2022-07-04T10:00:47Z) - Multiview Transformers for Video Recognition [69.50552269271526]
We present Multiview Video Recognition (MTV) for different resolutions.
MTV consistently performs better than single-view counterparts in terms of accuracy and computational cost.
We achieve state-of-the-art results on five standard datasets, and improve even further with large-scale pretraining.
arXiv Detail & Related papers (2022-01-12T03:33:57Z) - STAR: Sparse Transformer-based Action Recognition [61.490243467748314]
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data.
Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
arXiv Detail & Related papers (2021-07-15T02:53:11Z) - Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets.
Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z) - Unified Image and Video Saliency Modeling [21.701431656717112]
We ask: Can image and video saliency modeling be approached via a unified model?
We propose four novel domain adaptation techniques and an improved formulation of learned Gaussian priors.
We integrate these techniques into a simple and lightweight encoder-RNN-decoder-style network, UNISAL, and train it jointly with image and video saliency data.
We evaluate our method on the video saliency datasets DHF1K, Hollywood-2 and UCF-Sports, and the image saliency datasets SALICON and MIT300.
arXiv Detail & Related papers (2020-03-11T18:28:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.