Hierarchically Decoupled Spatial-Temporal Contrast for Self-supervised
Video Representation Learning
- URL: http://arxiv.org/abs/2011.11261v2
- Date: Tue, 31 Aug 2021 20:46:37 GMT
- Title: Hierarchically Decoupled Spatial-Temporal Contrast for Self-supervised
Video Representation Learning
- Authors: Zehua Zhang and David Crandall
- Abstract summary: We present a novel technique for self-supervised video representation learning by: (a) decoupling the learning objective into two contrastive subtasks respectively emphasizing spatial and temporal features, and (b) performing it hierarchically to encourage multi-scale understanding.
- Score: 6.523119805288132
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We present a novel technique for self-supervised video representation
learning by: (a) decoupling the learning objective into two contrastive
subtasks respectively emphasizing spatial and temporal features, and (b)
performing it hierarchically to encourage multi-scale understanding. Motivated
by their effectiveness in supervised learning, we first introduce
spatial-temporal feature learning decoupling and hierarchical learning to the
context of unsupervised video learning. We show by experiments that
augmentations can be manipulated as regularization to guide the network to
learn desired semantics in contrastive learning, and we propose a way for the
model to separately capture spatial and temporal features at multiple scales.
We also introduce an approach to overcome the problem of divergent levels of
instance invariance at different hierarchies by modeling the invariance as loss
weights for objective re-weighting. Experiments on downstream action
recognition benchmarks on UCF101 and HMDB51 show that our proposed
Hierarchically Decoupled Spatial-Temporal Contrast (HDC) makes substantial
improvements over directly learning spatial-temporal features as a whole and
achieves competitive performance when compared with other state-of-the-art
unsupervised methods. Code will be made available.
Related papers
- MOOSS: Mask-Enhanced Temporal Contrastive Learning for Smooth State Evolution in Visual Reinforcement Learning [8.61492882526007]
In visual Reinforcement Learning (RL), learning from pixel-based observations poses significant challenges on sample efficiency.
We introduce MOOSS, a novel framework that leverages a temporal contrastive objective with the help of graph-based spatial-temporal masking.
Our evaluation on multiple continuous and discrete control benchmarks shows that MOOSS outperforms previous state-of-the-art visual RL methods in terms of sample efficiency.
arXiv Detail & Related papers (2024-09-02T18:57:53Z) - Visually Robust Adversarial Imitation Learning from Videos with Contrastive Learning [9.240917262195046]
C-LAIfO is a computationally efficient algorithm designed for imitation learning from videos.
We analyze the problem of imitation from expert videos with visual discrepancies.
Our algorithm performs imitation entirely within this space using off-policy adversarial imitation learning.
arXiv Detail & Related papers (2024-06-18T20:56:18Z) - A Probabilistic Model Behind Self-Supervised Learning [53.64989127914936]
In self-supervised learning (SSL), representations are learned via an auxiliary task without annotated labels.
We present a generative latent variable model for self-supervised learning.
We show that several families of discriminative SSL, including contrastive methods, induce a comparable distribution over representations.
arXiv Detail & Related papers (2024-02-02T13:31:17Z) - Point Contrastive Prediction with Semantic Clustering for
Self-Supervised Learning on Point Cloud Videos [71.20376514273367]
We propose a unified point cloud video self-supervised learning framework for object-centric and scene-centric data.
Our method outperforms supervised counterparts on a wide range of downstream tasks.
arXiv Detail & Related papers (2023-08-18T02:17:47Z) - Learning Appearance-motion Normality for Video Anomaly Detection [11.658792932975652]
We propose spatial-temporal memories augmented two-stream auto-encoder framework.
It learns the appearance normality and motion normality independently and explores the correlations via adversarial learning.
Our framework outperforms the state-of-the-art methods, achieving AUCs of 98.1% and 89.8% on UCSD Ped2 and CUHK Avenue datasets.
arXiv Detail & Related papers (2022-07-27T08:30:19Z) - Hierarchically Self-Supervised Transformer for Human Skeleton
Representation Learning [45.13060970066485]
We propose a self-supervised hierarchical pre-training scheme incorporated into a hierarchical Transformer-based skeleton sequence encoder (Hi-TRS)
Under both supervised and semi-supervised evaluation protocols, our method achieves the state-of-the-art performance.
arXiv Detail & Related papers (2022-07-20T04:21:05Z) - Fine-grained Temporal Contrastive Learning for Weakly-supervised
Temporal Action Localization [87.47977407022492]
This paper argues that learning by contextually comparing sequence-to-sequence distinctions offers an essential inductive bias in weakly-supervised action localization.
Under a differentiable dynamic programming formulation, two complementary contrastive objectives are designed, including Fine-grained Sequence Distance (FSD) contrasting and Longest Common Subsequence (LCS) contrasting.
Our method achieves state-of-the-art performance on two popular benchmarks.
arXiv Detail & Related papers (2022-03-31T05:13:50Z) - Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp.
SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z) - Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z) - Self-supervised learning using consistency regularization of
spatio-temporal data augmentation for action recognition [15.701647552427708]
We present a novel way to obtain the surrogate supervision signal based on high-level feature maps under consistency regularization.
Our method achieves substantial improvements compared with state-of-the-art self-supervised learning methods for action recognition.
arXiv Detail & Related papers (2020-08-05T12:41:59Z) - Self-supervised Video Object Segmentation [76.83567326586162]
The objective of this paper is self-supervised representation learning, with the goal of solving semi-supervised video object segmentation (a.k.a. dense tracking)
We make the following contributions: (i) we propose to improve the existing self-supervised approach, with a simple, yet more effective memory mechanism for long-term correspondence matching; (ii) by augmenting the self-supervised approach with an online adaptation module, our method successfully alleviates tracker drifts caused by spatial-temporal discontinuity; (iv) we demonstrate state-of-the-art results among the self-supervised approaches on DAVIS-2017 and YouTube
arXiv Detail & Related papers (2020-06-22T17:55:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.