RSPNet: Relative Speed Perception for Unsupervised Video Representation
Learning
- URL: http://arxiv.org/abs/2011.07949v2
- Date: Mon, 15 Mar 2021 10:52:53 GMT
- Title: RSPNet: Relative Speed Perception for Unsupervised Video Representation
Learning
- Authors: Peihao Chen and Deng Huang and Dongliang He and Xiang Long and Runhao
Zeng and Shilei Wen and Mingkui Tan and Chuang Gan
- Abstract summary: We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only.
It is difficult to construct a suitable self-supervised task to well model both motion and appearance features.
We propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels.
- Score: 100.76672109782815
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study unsupervised video representation learning that seeks to learn both
motion and appearance features from unlabeled video only, which can be reused
for downstream tasks such as action recognition. This task, however, is
extremely challenging due to 1) the highly complex spatial-temporal information
in videos; and 2) the lack of labeled data for training. Unlike the
representation learning for static images, it is difficult to construct a
suitable self-supervised task to well model both motion and appearance
features. More recently, several attempts have been made to learn video
representation through video playback speed prediction. However, it is
non-trivial to obtain precise speed labels for the videos. More critically, the
learnt models may tend to focus on motion pattern and thus may not learn
appearance features well. In this paper, we observe that the relative playback
speed is more consistent with motion pattern, and thus provide more effective
and stable supervision for representation learning. Therefore, we propose a new
way to perceive the playback speed and exploit the relative speed between two
video clips as labels. In this way, we are able to well perceive speed and
learn better motion features. Moreover, to ensure the learning of appearance
features, we further propose an appearance-focused task, where we enforce the
model to perceive the appearance difference between two video clips. We show
that optimizing the two tasks jointly consistently improves the performance on
two downstream tasks, namely action recognition and video retrieval.
Remarkably, for action recognition on UCF101 dataset, we achieve 93.7% accuracy
without the use of labeled data for pre-training, which outperforms the
ImageNet supervised pre-trained model. Code and pre-trained models can be found
at https://github.com/PeihaoChen/RSPNet.
Related papers
- TimeBalance: Temporally-Invariant and Temporally-Distinctive Video
Representations for Semi-Supervised Action Recognition [68.53072549422775]
We propose a student-teacher semi-supervised learning framework, TimeBalance.
We distill the knowledge from a temporally-invariant and a temporally-distinctive teacher.
Our method achieves state-of-the-art performance on three action recognition benchmarks.
arXiv Detail & Related papers (2023-03-28T19:28:54Z) - Masked Video Distillation: Rethinking Masked Feature Modeling for
Self-supervised Video Representation Learning [123.63301596019522]
Masked video distillation (MVD) is a simple yet effective two-stage masked feature modeling framework for video representation learning.
For the choice of teacher models, we observe that students taught by video teachers perform better on temporally-heavy video tasks.
We design a spatial-temporal co-teaching method for MVD to leverage the advantage of different teachers.
arXiv Detail & Related papers (2022-12-08T18:59:59Z) - Self-supervised and Weakly Supervised Contrastive Learning for
Frame-wise Action Representations [26.09611987412578]
We introduce a new framework of contrastive action representation learning (CARL) to learn frame-wise action representation in a self-supervised or weakly-supervised manner.
Specifically, we introduce a simple but effective video encoder that considers both spatial and temporal context.
Our method outperforms previous state-of-the-art by a large margin for downstream fine-grained action classification and even faster inference.
arXiv Detail & Related papers (2022-12-06T16:42:22Z) - PreViTS: Contrastive Pretraining with Video Tracking Supervision [53.73237606312024]
PreViTS is an unsupervised SSL framework for selecting clips containing the same object.
PreViTS spatially constrains the frame regions to learn from and trains the model to locate meaningful objects.
We train a momentum contrastive (MoCo) encoder on VGG-Sound and Kinetics-400 datasets with PreViTS.
arXiv Detail & Related papers (2021-12-01T19:49:57Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Semi-Supervised Action Recognition with Temporal Contrastive Learning [50.08957096801457]
We learn a two-pathway temporal contrastive model using unlabeled videos at two different speeds.
We considerably outperform video extensions of sophisticated state-of-the-art semi-supervised image recognition methods.
arXiv Detail & Related papers (2021-02-04T17:28:35Z) - Self-supervised Video Representation Learning by Pace Prediction [48.029602040786685]
This paper addresses the problem of self-supervised video representation learning from a new perspective -- by video pace prediction.
It stems from the observation that human visual system is sensitive to video pace.
We randomly sample training clips in different paces and ask a neural network to identify the pace for each video clip.
arXiv Detail & Related papers (2020-08-13T12:40:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.