Self-supervised Temporal Discriminative Learning for Video
Representation Learning
- URL: http://arxiv.org/abs/2008.02129v1
- Date: Wed, 5 Aug 2020 13:36:59 GMT
- Title: Self-supervised Temporal Discriminative Learning for Video
Representation Learning
- Authors: Jinpeng Wang, Yiqi Lin, Andy J. Ma, Pong C. Yuen
- Abstract summary: Temporal-discriminative features can hardly be extracted without using an annotated large-scale video action dataset for training.
This paper proposes a novel Video-based Temporal-Discriminative Learning framework in self-supervised manner.
- Score: 39.43942923911425
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal cues in videos provide important information for recognizing actions
accurately. However, temporal-discriminative features can hardly be extracted
without using an annotated large-scale video action dataset for training. This
paper proposes a novel Video-based Temporal-Discriminative Learning (VTDL)
framework in self-supervised manner. Without labelled data for network
pretraining, temporal triplet is generated for each anchor video by using
segment of the same or different time interval so as to enhance the capacity
for temporal feature representation. Measuring temporal information by time
derivative, Temporal Consistent Augmentation (TCA) is designed to ensure that
the time derivative (in any order) of the augmented positive is invariant
except for a scaling constant. Finally, temporal-discriminative features are
learnt by minimizing the distance between each anchor and its augmented
positive, while the distance between each anchor and its augmented negative as
well as other videos saved in the memory bank is maximized to enrich the
representation diversity. In the downstream action recognition task, the
proposed method significantly outperforms existing related works. Surprisingly,
the proposed self-supervised approach is better than fully-supervised methods
on UCF101 and HMDB51 when a small-scale video dataset (with only thousands of
videos) is used for pre-training. The code has been made publicly available on
https://github.com/FingerRec/Self-Supervised-Temporal-Discriminative-Representation-Learning-for-Vid eo-Action-Recognition.
Related papers
- Self-Supervised Contrastive Learning for Videos using Differentiable Local Alignment [3.2873782624127834]
We present a self-supervised method for representation learning based on aligning temporal video sequences.
We introduce the novel Local-Alignment Contrastive (LAC) loss, which combines a differentiable local alignment loss to capture local temporal dependencies.
We show that our learned representations outperform existing state-of-the-art approaches on action recognition tasks.
arXiv Detail & Related papers (2024-09-06T20:32:53Z) - Self-Supervised Video Representation Learning via Latent Time Navigation [12.721647696921865]
Self-supervised video representation learning aims at maximizing similarity between different temporal segments of one video.
We propose Latent Time Navigation (LTN) to capture fine-grained motions.
Our experimental analysis suggests that learning video representations by LTN consistently improves performance of action classification.
arXiv Detail & Related papers (2023-05-10T20:06:17Z) - TimeBalance: Temporally-Invariant and Temporally-Distinctive Video
Representations for Semi-Supervised Action Recognition [68.53072549422775]
We propose a student-teacher semi-supervised learning framework, TimeBalance.
We distill the knowledge from a temporally-invariant and a temporally-distinctive teacher.
Our method achieves state-of-the-art performance on three action recognition benchmarks.
arXiv Detail & Related papers (2023-03-28T19:28:54Z) - An Empirical Study of End-to-End Temporal Action Detection [82.64373812690127]
Temporal action detection (TAD) is an important yet challenging task in video understanding.
Rather than end-to-end learning, most existing methods adopt a head-only learning paradigm.
We validate the advantage of end-to-end learning over head-only learning and observe up to 11% performance improvement.
arXiv Detail & Related papers (2022-04-06T16:46:30Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Composable Augmentation Encoding for Video Representation Learning [94.2358972764708]
We focus on contrastive methods for self-supervised video representation learning.
A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negatives.
We propose an 'augmentation aware' contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations.
We show that our method encodes valuable information about specified spatial or temporal augmentation, and in doing so also achieve state-of-the-art performance on a number of video benchmarks.
arXiv Detail & Related papers (2021-04-01T16:48:53Z) - Learning by Aligning Videos in Time [10.075645944474287]
We present a self-supervised approach for learning video representations using temporal video alignment as a pretext task.
We leverage a novel combination of temporal alignment loss and temporal regularization terms, which can be used as supervision signals for training an encoder network.
arXiv Detail & Related papers (2021-03-31T17:55:52Z) - TCLR: Temporal Contrastive Learning for Video Representation [49.6637562402604]
We develop a new temporal contrastive learning framework consisting of two novel losses to improve upon existing contrastive self-supervised video representation learning methods.
With the commonly used 3D-ResNet-18 architecture, we achieve 82.4% (+5.1% increase over the previous best) top-1 accuracy on UCF101 and 52.9% (+5.4% increase) on HMDB51 action classification.
arXiv Detail & Related papers (2021-01-20T05:38:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.