Time-Equivariant Contrastive Video Representation Learning
- URL: http://arxiv.org/abs/2112.03624v1
- Date: Tue, 7 Dec 2021 10:45:43 GMT
- Title: Time-Equivariant Contrastive Video Representation Learning
- Authors: Simon Jenni and Hailin Jin
- Abstract summary: We introduce a novel self-supervised contrastive learning method to learn representations from unlabelled videos.
Our experiments show that time-equivariant representations achieve state-of-the-art results in video retrieval and action recognition benchmarks.
- Score: 47.50766781135863
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a novel self-supervised contrastive learning method to learn
representations from unlabelled videos. Existing approaches ignore the
specifics of input distortions, e.g., by learning invariance to temporal
transformations. Instead, we argue that video representation should preserve
video dynamics and reflect temporal manipulations of the input. Therefore, we
exploit novel constraints to build representations that are equivariant to
temporal transformations and better capture video dynamics. In our method,
relative temporal transformations between augmented clips of a video are
encoded in a vector and contrasted with other transformation vectors. To
support temporal equivariance learning, we additionally propose the
self-supervised classification of two clips of a video into 1. overlapping 2.
ordered, or 3. unordered. Our experiments show that time-equivariant
representations achieve state-of-the-art results in video retrieval and action
recognition benchmarks on UCF101, HMDB51, and Diving48.
Related papers
- Self-Supervised Video Representation Learning via Latent Time Navigation [12.721647696921865]
Self-supervised video representation learning aims at maximizing similarity between different temporal segments of one video.
We propose Latent Time Navigation (LTN) to capture fine-grained motions.
Our experimental analysis suggests that learning video representations by LTN consistently improves performance of action classification.
arXiv Detail & Related papers (2023-05-10T20:06:17Z) - TimeBalance: Temporally-Invariant and Temporally-Distinctive Video
Representations for Semi-Supervised Action Recognition [68.53072549422775]
We propose a student-teacher semi-supervised learning framework, TimeBalance.
We distill the knowledge from a temporally-invariant and a temporally-distinctive teacher.
Our method achieves state-of-the-art performance on three action recognition benchmarks.
arXiv Detail & Related papers (2023-03-28T19:28:54Z) - Time Is MattEr: Temporal Self-supervision for Video Transformers [72.42240984211283]
We design simple yet effective self-supervised tasks for video models to learn temporal dynamics better.
Our method learns the temporal order of video frames as extra self-supervision and enforces the randomly shuffled frames to have low-confidence outputs.
Under various video action recognition tasks, we demonstrate the effectiveness of our method and its compatibility with state-of-the-art Video Transformers.
arXiv Detail & Related papers (2022-07-19T04:44:08Z) - Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets.
Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z) - Composable Augmentation Encoding for Video Representation Learning [94.2358972764708]
We focus on contrastive methods for self-supervised video representation learning.
A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negatives.
We propose an 'augmentation aware' contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations.
We show that our method encodes valuable information about specified spatial or temporal augmentation, and in doing so also achieve state-of-the-art performance on a number of video benchmarks.
arXiv Detail & Related papers (2021-04-01T16:48:53Z) - Semi-Supervised Action Recognition with Temporal Contrastive Learning [50.08957096801457]
We learn a two-pathway temporal contrastive model using unlabeled videos at two different speeds.
We considerably outperform video extensions of sophisticated state-of-the-art semi-supervised image recognition methods.
arXiv Detail & Related papers (2021-02-04T17:28:35Z) - Self-Supervised Visual Learning by Variable Playback Speeds Prediction
of a Video [23.478555947694108]
We propose a self-supervised visual learning method by predicting the variable playback speeds of a video.
We learn the meta-temporal visual variations in the video by leveraging the variations in the visual appearance according to playback speeds.
We also propose a new layer dependable temporal group normalization method that can be applied to 3D convolutional networks.
arXiv Detail & Related papers (2020-03-05T15:01:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.