Self-Supervised Video Representation Learning via Latent Time Navigation
- URL: http://arxiv.org/abs/2305.06437v1
- Date: Wed, 10 May 2023 20:06:17 GMT
- Title: Self-Supervised Video Representation Learning via Latent Time Navigation
- Authors: Di Yang, Yaohui Wang, Quan Kong, Antitza Dantcheva, Lorenzo Garattoni,
Gianpiero Francesca, Francois Bremond
- Abstract summary: Self-supervised video representation learning aims at maximizing similarity between different temporal segments of one video.
We propose Latent Time Navigation (LTN) to capture fine-grained motions.
Our experimental analysis suggests that learning video representations by LTN consistently improves performance of action classification.
- Score: 12.721647696921865
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-supervised video representation learning aimed at maximizing similarity
between different temporal segments of one video, in order to enforce feature
persistence over time. This leads to loss of pertinent information related to
temporal relationships, rendering actions such as `enter' and `leave' to be
indistinguishable. To mitigate this limitation, we propose Latent Time
Navigation (LTN), a time-parameterized contrastive learning strategy that is
streamlined to capture fine-grained motions. Specifically, we maximize the
representation similarity between different video segments from one video,
while maintaining their representations time-aware along a subspace of the
latent representation code including an orthogonal basis to represent temporal
changes. Our extensive experimental analysis suggests that learning video
representations by LTN consistently improves performance of action
classification in fine-grained and human-oriented tasks (e.g., on Toyota
Smarthome dataset). In addition, we demonstrate that our proposed model, when
pre-trained on Kinetics-400, generalizes well onto the unseen real world video
benchmark datasets UCF101 and HMDB51, achieving state-of-the-art performance in
action recognition.
Related papers
- TimeBalance: Temporally-Invariant and Temporally-Distinctive Video
Representations for Semi-Supervised Action Recognition [68.53072549422775]
We propose a student-teacher semi-supervised learning framework, TimeBalance.
We distill the knowledge from a temporally-invariant and a temporally-distinctive teacher.
Our method achieves state-of-the-art performance on three action recognition benchmarks.
arXiv Detail & Related papers (2023-03-28T19:28:54Z) - TCGL: Temporal Contrastive Graph for Self-supervised Video
Representation Learning [79.77010271213695]
We propose a novel video self-supervised learning framework named Temporal Contrastive Graph Learning (TCGL)
Our TCGL integrates the prior knowledge about the frame and snippet orders into graph structures, i.e., the intra-/inter- snippet Temporal Contrastive Graphs (TCG)
To generate supervisory signals for unlabeled videos, we introduce an Adaptive Snippet Order Prediction (ASOP) module.
arXiv Detail & Related papers (2021-12-07T09:27:56Z) - Learning from Temporal Gradient for Semi-supervised Action Recognition [15.45239134477737]
We introduce temporal gradient as an additional modality for more attentive feature extraction.
Our method achieves the state-of-the-art performance on three video action recognition benchmarks.
arXiv Detail & Related papers (2021-11-25T20:30:30Z) - Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Composable Augmentation Encoding for Video Representation Learning [94.2358972764708]
We focus on contrastive methods for self-supervised video representation learning.
A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negatives.
We propose an 'augmentation aware' contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations.
We show that our method encodes valuable information about specified spatial or temporal augmentation, and in doing so also achieve state-of-the-art performance on a number of video benchmarks.
arXiv Detail & Related papers (2021-04-01T16:48:53Z) - Self-supervised Temporal Discriminative Learning for Video
Representation Learning [39.43942923911425]
Temporal-discriminative features can hardly be extracted without using an annotated large-scale video action dataset for training.
This paper proposes a novel Video-based Temporal-Discriminative Learning framework in self-supervised manner.
arXiv Detail & Related papers (2020-08-05T13:36:59Z) - Video Representation Learning with Visual Tempo Consistency [105.20094164316836]
We show that visual tempo can serve as a self-supervision signal for video representation learning.
We propose to maximize the mutual information between representations of slow and fast videos via hierarchical contrastive learning.
arXiv Detail & Related papers (2020-06-28T02:46:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.