Contrastive Spatio-Temporal Pretext Learning for Self-supervised Video
Representation
- URL: http://arxiv.org/abs/2112.08913v2
- Date: Sun, 19 Dec 2021 14:17:12 GMT
- Title: Contrastive Spatio-Temporal Pretext Learning for Self-supervised Video
Representation
- Authors: Yujia Zhang, Lai-Man Po, Xuyuan Xu, Mengyang Liu, Yexin Wang, Weifeng
Ou, Yuzhi Zhao, Wing-Yin Yu
- Abstract summary: We propose a novel pretext task -temporal overlap rate (STOR) prediction.
It stems from observation that humans are capable of discriminating overlap rates of videos in space and time.
We employ a joint task combining contrastive learning to further the enhance-temporal representation learning.
- Score: 16.643709221279764
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spatio-temporal representation learning is critical for video self-supervised
representation. Recent approaches mainly use contrastive learning and pretext
tasks. However, these approaches learn representation by discriminating sampled
instances via feature similarity in the latent space while ignoring the
intermediate state of the learned representations, which limits the overall
performance. In this work, taking into account the degree of similarity of
sampled instances as the intermediate state, we propose a novel pretext task -
spatio-temporal overlap rate (STOR) prediction. It stems from the observation
that humans are capable of discriminating the overlap rates of videos in space
and time. This task encourages the model to discriminate the STOR of two
generated samples to learn the representations. Moreover, we employ a joint
optimization combining pretext tasks with contrastive learning to further
enhance the spatio-temporal representation learning. We also study the mutual
influence of each component in the proposed scheme. Extensive experiments
demonstrate that our proposed STOR task can favor both contrastive learning and
pretext tasks. The joint optimization scheme can significantly improve the
spatio-temporal representation in video understanding. The code is available at
https://github.com/Katou2/CSTP.
Related papers
- A Dual Approach to Imitation Learning from Observations with Offline Datasets [19.856363985916644]
Demonstrations are an effective alternative to task specification for learning agents in settings where designing a reward function is difficult.
We derive DILO, an algorithm that can leverage arbitrary suboptimal data to learn imitating policies without requiring expert actions.
arXiv Detail & Related papers (2024-06-13T04:39:42Z) - Towards a Better Understanding of Representation Dynamics under
TD-learning [23.65188248947536]
TD-learning is a foundation reinforcement learning (RL) algorithm for value prediction.
In this work, we consider the question: how does end-to-end TD-learning impact the representation over time?
We first show that when the environments are reversible, end-to-end TD-learning strictly decreases the value approximation error over time.
arXiv Detail & Related papers (2023-05-29T13:34:40Z) - TimeBalance: Temporally-Invariant and Temporally-Distinctive Video
Representations for Semi-Supervised Action Recognition [68.53072549422775]
We propose a student-teacher semi-supervised learning framework, TimeBalance.
We distill the knowledge from a temporally-invariant and a temporally-distinctive teacher.
Our method achieves state-of-the-art performance on three action recognition benchmarks.
arXiv Detail & Related papers (2023-03-28T19:28:54Z) - Multi-Task Self-Supervised Time-Series Representation Learning [3.31490164885582]
Time-series representation learning can extract representations from data with temporal dynamics and sparse labels.
We propose a new time-series representation learning method by combining the advantages of self-supervised tasks.
We evaluate the proposed framework on three downstream tasks: time-series classification, forecasting, and anomaly detection.
arXiv Detail & Related papers (2023-03-02T07:44:06Z) - Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp.
SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z) - Time-Series Representation Learning via Temporal and Contextual
Contrasting [14.688033556422337]
We propose an unsupervised Time-Series representation learning framework via Temporal and Contextual Contrasting (TS-TCC)
First, the raw time-series data are transformed into two different yet correlated views by using weak and strong augmentations.
Second, we propose a novel temporal contrasting module to learn robust temporal representations by designing a tough cross-view prediction task.
Third, to further learn discriminative representations, we propose a contextual contrasting module built upon the contexts from the temporal contrasting module.
arXiv Detail & Related papers (2021-06-26T23:56:31Z) - Revisiting Contrastive Methods for Unsupervised Learning of Visual
Representations [78.12377360145078]
Contrastive self-supervised learning has outperformed supervised pretraining on many downstream tasks like segmentation and object detection.
In this paper, we first study how biases in the dataset affect existing methods.
We show that current contrastive approaches work surprisingly well across: (i) object- versus scene-centric, (ii) uniform versus long-tailed and (iii) general versus domain-specific datasets.
arXiv Detail & Related papers (2021-06-10T17:59:13Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Composable Augmentation Encoding for Video Representation Learning [94.2358972764708]
We focus on contrastive methods for self-supervised video representation learning.
A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negatives.
We propose an 'augmentation aware' contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations.
We show that our method encodes valuable information about specified spatial or temporal augmentation, and in doing so also achieve state-of-the-art performance on a number of video benchmarks.
arXiv Detail & Related papers (2021-04-01T16:48:53Z) - Hierarchically Decoupled Spatial-Temporal Contrast for Self-supervised
Video Representation Learning [6.523119805288132]
We present a novel technique for self-supervised video representation learning by: (a) decoupling the learning objective into two contrastive subtasks respectively emphasizing spatial and temporal features, and (b) performing it hierarchically to encourage multi-scale understanding.
arXiv Detail & Related papers (2020-11-23T08:05:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.