Long-Short Temporal Contrastive Learning of Video Transformers
- URL: http://arxiv.org/abs/2106.09212v1
- Date: Thu, 17 Jun 2021 02:30:26 GMT
- Title: Long-Short Temporal Contrastive Learning of Video Transformers
- Authors: Jue Wang, Gedas Bertasius, Du Tran, Lorenzo Torresani
- Abstract summary: Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets.
Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
- Score: 62.71874976426988
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Video transformers have recently emerged as a competitive alternative to 3D
CNNs for video understanding. However, due to their large number of parameters
and reduced inductive biases, these models require supervised pretraining on
large-scale image datasets to achieve top performance. In this paper, we
empirically demonstrate that self-supervised pretraining of video transformers
on video-only datasets can lead to action recognition results that are on par
or better than those obtained with supervised pretraining on large-scale image
datasets, even massive ones such as ImageNet-21K. Since transformer-based
models are effective at capturing dependencies over extended temporal spans, we
propose a simple learning procedure that forces the model to match a long-term
view to a short-term view of the same video. Our approach, named Long-Short
Temporal Contrastive Learning (LSTCL), enables video transformers to learn an
effective clip-level representation by predicting temporal context captured
from a longer temporal extent. To demonstrate the generality of our findings,
we implement and validate our approach under three different self-supervised
contrastive learning frameworks (MoCo v3, BYOL, SimSiam) using two distinct
video-transformer architectures, including an improved variant of the Swin
Transformer augmented with space-time attention. We conduct a thorough ablation
study and show that LSTCL achieves competitive performance on multiple video
benchmarks and represents a convincing alternative to supervised image-based
pretraining.
Related papers
- F3-Pruning: A Training-Free and Generalized Pruning Strategy towards
Faster and Finer Text-to-Video Synthesis [94.10861578387443]
We explore the inference process of two mainstream T2V models using transformers and diffusion models.
We propose a training-free and generalized pruning strategy called F3-Pruning to prune redundant temporal attention weights.
Extensive experiments on three datasets using a classic transformer-based model CogVideo and a typical diffusion-based model Tune-A-Video verify the effectiveness of F3-Pruning.
arXiv Detail & Related papers (2023-12-06T12:34:47Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Self-supervised and Weakly Supervised Contrastive Learning for
Frame-wise Action Representations [26.09611987412578]
We introduce a new framework of contrastive action representation learning (CARL) to learn frame-wise action representation in a self-supervised or weakly-supervised manner.
Specifically, we introduce a simple but effective video encoder that considers both spatial and temporal context.
Our method outperforms previous state-of-the-art by a large margin for downstream fine-grained action classification and even faster inference.
arXiv Detail & Related papers (2022-12-06T16:42:22Z) - SVFormer: Semi-supervised Video Transformer for Action Recognition [88.52042032347173]
We introduce SVFormer, which adopts a steady pseudo-labeling framework to cope with unlabeled video samples.
In addition, we propose a temporal warping to cover the complex temporal variation in videos.
In particular, SVFormer outperforms the state-of-the-art by 31.5% with fewer training epochs under the 1% labeling rate of Kinetics-400.
arXiv Detail & Related papers (2022-11-23T18:58:42Z) - BEVT: BERT Pretraining of Video Transformers [89.08460834954161]
We introduce BEVT which decouples video representation learning into spatial representation learning and temporal dynamics learning.
We conduct extensive experiments on three challenging video benchmarks where BEVT achieves very promising results.
arXiv Detail & Related papers (2021-12-02T18:59:59Z) - Shifted Chunk Transformer for Spatio-Temporal Representational Learning [24.361059477031162]
We construct a shifted chunk Transformer with pure self-attention blocks.
This Transformer can learn hierarchical-temporal features from a tiny patch to a global video clip.
It outperforms state-of-the-art approaches on Kinetics, Kinetics-600, UCF101, and HMDB51.
arXiv Detail & Related papers (2021-08-26T04:34:33Z) - ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification.
Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers.
We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.