BEVT: BERT Pretraining of Video Transformers
- URL: http://arxiv.org/abs/2112.01529v1
- Date: Thu, 2 Dec 2021 18:59:59 GMT
- Title: BEVT: BERT Pretraining of Video Transformers
- Authors: Rui Wang and Dongdong Chen and Zuxuan Wu and Yinpeng Chen and Xiyang
Dai and Mengchen Liu and Yu-Gang Jiang and Luowei Zhou and Lu Yuan
- Abstract summary: We introduce BEVT which decouples video representation learning into spatial representation learning and temporal dynamics learning.
We conduct extensive experiments on three challenging video benchmarks where BEVT achieves very promising results.
- Score: 89.08460834954161
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper studies the BERT pretraining of video transformers. It is a
straightforward but worth-studying extension given the recent success from BERT
pretraining of image transformers. We introduce BEVT which decouples video
representation learning into spatial representation learning and temporal
dynamics learning. In particular, BEVT first performs masked image modeling on
image data, and then conducts masked image modeling jointly with masked video
modeling on video data. This design is motivated by two observations: 1)
transformers learned on image datasets provide decent spatial priors that can
ease the learning of video transformers, which are often times
computationally-intensive if trained from scratch; 2) discriminative clues,
i.e., spatial and temporal information, needed to make correct predictions vary
among different videos due to large intra-class and inter-class variations. We
conduct extensive experiments on three challenging video benchmarks where BEVT
achieves very promising results. On Kinetics 400, for which recognition mostly
relies on discriminative spatial representations, BEVT achieves comparable
results to strong supervised baselines. On Something-Something-V2 and Diving
48, which contain videos relying on temporal dynamics, BEVT outperforms by
clear margins all alternative baselines and achieves state-of-the-art
performance with a 70.6% and 86.7% Top-1 accuracy respectively.
Related papers
- Masked Video Distillation: Rethinking Masked Feature Modeling for
Self-supervised Video Representation Learning [123.63301596019522]
Masked video distillation (MVD) is a simple yet effective two-stage masked feature modeling framework for video representation learning.
For the choice of teacher models, we observe that students taught by video teachers perform better on temporally-heavy video tasks.
We design a spatial-temporal co-teaching method for MVD to leverage the advantage of different teachers.
arXiv Detail & Related papers (2022-12-08T18:59:59Z) - SVFormer: Semi-supervised Video Transformer for Action Recognition [88.52042032347173]
We introduce SVFormer, which adopts a steady pseudo-labeling framework to cope with unlabeled video samples.
In addition, we propose a temporal warping to cover the complex temporal variation in videos.
In particular, SVFormer outperforms the state-of-the-art by 31.5% with fewer training epochs under the 1% labeling rate of Kinetics-400.
arXiv Detail & Related papers (2022-11-23T18:58:42Z) - PreViTS: Contrastive Pretraining with Video Tracking Supervision [53.73237606312024]
PreViTS is an unsupervised SSL framework for selecting clips containing the same object.
PreViTS spatially constrains the frame regions to learn from and trains the model to locate meaningful objects.
We train a momentum contrastive (MoCo) encoder on VGG-Sound and Kinetics-400 datasets with PreViTS.
arXiv Detail & Related papers (2021-12-01T19:49:57Z) - Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets.
Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z) - ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification.
Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers.
We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.