Self-supervised Video Representation Learning by Pace Prediction
- URL: http://arxiv.org/abs/2008.05861v2
- Date: Fri, 4 Sep 2020 08:05:35 GMT
- Title: Self-supervised Video Representation Learning by Pace Prediction
- Authors: Jiangliu Wang, Jianbo Jiao, and Yun-Hui Liu
- Abstract summary: This paper addresses the problem of self-supervised video representation learning from a new perspective -- by video pace prediction.
It stems from the observation that human visual system is sensitive to video pace.
We randomly sample training clips in different paces and ask a neural network to identify the pace for each video clip.
- Score: 48.029602040786685
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper addresses the problem of self-supervised video representation
learning from a new perspective -- by video pace prediction. It stems from the
observation that human visual system is sensitive to video pace, e.g., slow
motion, a widely used technique in film making. Specifically, given a video
played in natural pace, we randomly sample training clips in different paces
and ask a neural network to identify the pace for each video clip. The
assumption here is that the network can only succeed in such a pace reasoning
task when it understands the underlying video content and learns representative
spatio-temporal features. In addition, we further introduce contrastive
learning to push the model towards discriminating different paces by maximizing
the agreement on similar video content. To validate the effectiveness of the
proposed method, we conduct extensive experiments on action recognition and
video retrieval tasks with several alternative network architectures.
Experimental evaluations show that our approach achieves state-of-the-art
performance for self-supervised video representation learning across different
network architectures and different benchmarks. The code and pre-trained models
are available at https://github.com/laura-wang/video-pace.
Related papers
- Deep Video Prior for Video Consistency and Propagation [58.250209011891904]
We present a novel and general approach for blind video temporal consistency.
Our method is only trained on a pair of original and processed videos directly instead of a large dataset.
We show that temporal consistency can be achieved by training a convolutional neural network on a video with Deep Video Prior.
arXiv Detail & Related papers (2022-01-27T16:38:52Z) - Self-Supervised Video Representation Learning by Video Incoherence
Detection [28.540645395066434]
This paper introduces a novel self-supervised method that leverages incoherence detection for video representation learning.
It roots from the observation that visual systems of human beings can easily identify video incoherence based on their comprehensive understanding of videos.
arXiv Detail & Related papers (2021-09-26T04:58:13Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Semi-Supervised Action Recognition with Temporal Contrastive Learning [50.08957096801457]
We learn a two-pathway temporal contrastive model using unlabeled videos at two different speeds.
We considerably outperform video extensions of sophisticated state-of-the-art semi-supervised image recognition methods.
arXiv Detail & Related papers (2021-02-04T17:28:35Z) - RSPNet: Relative Speed Perception for Unsupervised Video Representation
Learning [100.76672109782815]
We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only.
It is difficult to construct a suitable self-supervised task to well model both motion and appearance features.
We propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels.
arXiv Detail & Related papers (2020-10-27T16:42:50Z) - Blind Video Temporal Consistency via Deep Video Prior [61.062900556483164]
We present a novel and general approach for blind video temporal consistency.
Our method is only trained on a pair of original and processed videos directly.
We show that temporal consistency can be achieved by training a convolutional network on a video with the Deep Video Prior.
arXiv Detail & Related papers (2020-10-22T16:19:20Z) - Exploring Relations in Untrimmed Videos for Self-Supervised Learning [17.670226952829506]
Existing self-supervised learning methods mainly rely on trimmed videos for model training.
We propose a novel self-supervised method, referred to as Exploring Relations in Untemporal Videos (ERUV)
ERUV is able to learn richer representations and it outperforms state-of-the-art self-supervised methods with significant margins.
arXiv Detail & Related papers (2020-08-06T15:29:25Z) - Self-supervised Video Representation Learning Using Inter-intra
Contrastive Framework [43.002621928500425]
We propose a self-supervised method to learn feature representations from videos.
Because video representation is important, we extend negative samples by introducing intra-negative samples.
We conduct experiments on video retrieval and video recognition tasks using the learned video representation.
arXiv Detail & Related papers (2020-08-06T09:08:14Z) - Self-Supervised Visual Learning by Variable Playback Speeds Prediction
of a Video [23.478555947694108]
We propose a self-supervised visual learning method by predicting the variable playback speeds of a video.
We learn the meta-temporal visual variations in the video by leveraging the variations in the visual appearance according to playback speeds.
We also propose a new layer dependable temporal group normalization method that can be applied to 3D convolutional networks.
arXiv Detail & Related papers (2020-03-05T15:01:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.