Spatiotemporal Contrastive Video Representation Learning
- URL: http://arxiv.org/abs/2008.03800v4
- Date: Mon, 5 Apr 2021 19:51:20 GMT
- Title: Spatiotemporal Contrastive Video Representation Learning
- Authors: Rui Qian, Tianjian Meng, Boqing Gong, Ming-Hsuan Yang, Huisheng Wang,
Serge Belongie, Yin Cui
- Abstract summary: We present a self-supervised Contrastive Video Representation Learning (CVRL) method to learn visual representations from unlabeled videos.
Our representations are learned using a contrasttemporalive loss, where two augmented clips from the same short video are pulled together in the embedding space.
We study what makes for good data augmentations for video self-supervised learning and find that both spatial and temporal information are crucial.
- Score: 87.56145031149869
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a self-supervised Contrastive Video Representation Learning (CVRL)
method to learn spatiotemporal visual representations from unlabeled videos.
Our representations are learned using a contrastive loss, where two augmented
clips from the same short video are pulled together in the embedding space,
while clips from different videos are pushed away. We study what makes for good
data augmentations for video self-supervised learning and find that both
spatial and temporal information are crucial. We carefully design data
augmentations involving spatial and temporal cues. Concretely, we propose a
temporally consistent spatial augmentation method to impose strong spatial
augmentations on each frame of the video while maintaining the temporal
consistency across frames. We also propose a sampling-based temporal
augmentation method to avoid overly enforcing invariance on clips that are
distant in time. On Kinetics-600, a linear classifier trained on the
representations learned by CVRL achieves 70.4% top-1 accuracy with a
3D-ResNet-50 (R3D-50) backbone, outperforming ImageNet supervised pre-training
by 15.7% and SimCLR unsupervised pre-training by 18.8% using the same inflated
R3D-50. The performance of CVRL can be further improved to 72.9% with a larger
R3D-152 (2x filters) backbone, significantly closing the gap between
unsupervised and supervised video representation learning. Our code and models
will be available at
https://github.com/tensorflow/models/tree/master/official/.
Related papers
- Learning Temporally Consistent Video Depth from Video Diffusion Priors [57.929828486615605]
This work addresses the challenge of video depth estimation.
We reformulate the prediction task into a conditional generation problem.
This allows us to leverage the prior knowledge embedded in existing video generation models.
arXiv Detail & Related papers (2024-06-03T16:20:24Z) - ARVideo: Autoregressive Pretraining for Self-Supervised Video Representation Learning [29.620990627792906]
This paper presents a new self-supervised video representation learning framework, ARVideo, which autoregressively predicts the next video token in a tailored sequence order.
Extensive experiments establish ARVideo as an effective paradigm for self-supervised video representation learning.
arXiv Detail & Related papers (2024-05-24T02:29:03Z) - In Defense of Image Pre-Training for Spatiotemporal Recognition [32.56468478601864]
Key to effectively leveraging image pre-training lies in the decomposition of learning spatial and temporal features.
New pipeline consistently achieves better results on video recognition with significant speedup.
arXiv Detail & Related papers (2022-05-03T18:45:44Z) - Video Summarization through Reinforcement Learning with a 3D
Spatio-Temporal U-Net [15.032516344808526]
We introduce 3DST-UNet-RL framework for video summarization.
We show experimental evidence for the effectiveness of 3DST-UNet-RL on two commonly used general video summarization benchmarks.
The proposed video summarization has the potential to save storage costs of ultrasound screening videos as well as to increase efficiency when browsing patient video data during retrospective analysis.
arXiv Detail & Related papers (2021-06-19T16:27:19Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Space-Time Crop & Attend: Improving Cross-modal Video Representation
Learning [88.71867887257274]
We show that spatial augmentations such as cropping work well for videos too, but that previous implementations could not do this at a scale sufficient for it to work well.
To address this issue, we first introduce Feature Crop, a method to simulate such augmentations much more efficiently directly in feature space.
Second, we show that as opposed to naive average pooling, the use of transformer-based attention performance improves significantly.
arXiv Detail & Related papers (2021-03-18T12:32:24Z) - TCLR: Temporal Contrastive Learning for Video Representation [49.6637562402604]
We develop a new temporal contrastive learning framework consisting of two novel losses to improve upon existing contrastive self-supervised video representation learning methods.
With the commonly used 3D-ResNet-18 architecture, we achieve 82.4% (+5.1% increase over the previous best) top-1 accuracy on UCF101 and 52.9% (+5.4% increase) on HMDB51 action classification.
arXiv Detail & Related papers (2021-01-20T05:38:16Z) - RSPNet: Relative Speed Perception for Unsupervised Video Representation
Learning [100.76672109782815]
We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only.
It is difficult to construct a suitable self-supervised task to well model both motion and appearance features.
We propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels.
arXiv Detail & Related papers (2020-10-27T16:42:50Z) - Self-supervised Temporal Discriminative Learning for Video
Representation Learning [39.43942923911425]
Temporal-discriminative features can hardly be extracted without using an annotated large-scale video action dataset for training.
This paper proposes a novel Video-based Temporal-Discriminative Learning framework in self-supervised manner.
arXiv Detail & Related papers (2020-08-05T13:36:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.