Self-Supervised Visual Learning by Variable Playback Speeds Prediction
of a Video
- URL: http://arxiv.org/abs/2003.02692v2
- Date: Fri, 21 May 2021 02:39:54 GMT
- Title: Self-Supervised Visual Learning by Variable Playback Speeds Prediction
of a Video
- Authors: Hyeon Cho, Taehoon Kim, Hyung Jin Chang, Wonjun Hwang
- Abstract summary: We propose a self-supervised visual learning method by predicting the variable playback speeds of a video.
We learn the meta-temporal visual variations in the video by leveraging the variations in the visual appearance according to playback speeds.
We also propose a new layer dependable temporal group normalization method that can be applied to 3D convolutional networks.
- Score: 23.478555947694108
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We propose a self-supervised visual learning method by predicting the
variable playback speeds of a video. Without semantic labels, we learn the
spatio-temporal visual representation of the video by leveraging the variations
in the visual appearance according to different playback speeds under the
assumption of temporal coherence. To learn the spatio-temporal visual
variations in the entire video, we have not only predicted a single playback
speed but also generated clips of various playback speeds and directions with
randomized starting points. Hence the visual representation can be successfully
learned from the meta information (playback speeds and directions) of the
video. We also propose a new layer dependable temporal group normalization
method that can be applied to 3D convolutional networks to improve the
representation learning performance where we divide the temporal features into
several groups and normalize each one using the different corresponding
parameters. We validate the effectiveness of our method by fine-tuning it to
the action recognition and video retrieval tasks on UCF-101 and HMDB-51.
Related papers
- Self-Supervised Video Representation Learning via Latent Time Navigation [12.721647696921865]
Self-supervised video representation learning aims at maximizing similarity between different temporal segments of one video.
We propose Latent Time Navigation (LTN) to capture fine-grained motions.
Our experimental analysis suggests that learning video representations by LTN consistently improves performance of action classification.
arXiv Detail & Related papers (2023-05-10T20:06:17Z) - Audio-Visual Contrastive Learning with Temporal Self-Supervision [84.11385346896412]
We propose a self-supervised learning approach for videos that learns representations of both the RGB frames and the accompanying audio without human supervision.
To leverage the temporal and aural dimension inherent to videos, our method extends temporal self-supervision to the audio-visual setting.
arXiv Detail & Related papers (2023-02-15T15:00:55Z) - Time-Equivariant Contrastive Video Representation Learning [47.50766781135863]
We introduce a novel self-supervised contrastive learning method to learn representations from unlabelled videos.
Our experiments show that time-equivariant representations achieve state-of-the-art results in video retrieval and action recognition benchmarks.
arXiv Detail & Related papers (2021-12-07T10:45:43Z) - Composable Augmentation Encoding for Video Representation Learning [94.2358972764708]
We focus on contrastive methods for self-supervised video representation learning.
A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negatives.
We propose an 'augmentation aware' contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations.
We show that our method encodes valuable information about specified spatial or temporal augmentation, and in doing so also achieve state-of-the-art performance on a number of video benchmarks.
arXiv Detail & Related papers (2021-04-01T16:48:53Z) - Semi-Supervised Action Recognition with Temporal Contrastive Learning [50.08957096801457]
We learn a two-pathway temporal contrastive model using unlabeled videos at two different speeds.
We considerably outperform video extensions of sophisticated state-of-the-art semi-supervised image recognition methods.
arXiv Detail & Related papers (2021-02-04T17:28:35Z) - RSPNet: Relative Speed Perception for Unsupervised Video Representation
Learning [100.76672109782815]
We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only.
It is difficult to construct a suitable self-supervised task to well model both motion and appearance features.
We propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels.
arXiv Detail & Related papers (2020-10-27T16:42:50Z) - Self-supervised Video Representation Learning by Pace Prediction [48.029602040786685]
This paper addresses the problem of self-supervised video representation learning from a new perspective -- by video pace prediction.
It stems from the observation that human visual system is sensitive to video pace.
We randomly sample training clips in different paces and ask a neural network to identify the pace for each video clip.
arXiv Detail & Related papers (2020-08-13T12:40:24Z) - Video Representation Learning with Visual Tempo Consistency [105.20094164316836]
We show that visual tempo can serve as a self-supervision signal for video representation learning.
We propose to maximize the mutual information between representations of slow and fast videos via hierarchical contrastive learning.
arXiv Detail & Related papers (2020-06-28T02:46:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.