TCGL: Temporal Contrastive Graph for Self-supervised Video
Representation Learning
- URL: http://arxiv.org/abs/2112.03587v1
- Date: Tue, 7 Dec 2021 09:27:56 GMT
- Title: TCGL: Temporal Contrastive Graph for Self-supervised Video
Representation Learning
- Authors: Yang Liu, Keze Wang, Lingbo Liu, Haoyuan Lan, Liang Lin
- Abstract summary: We propose a novel video self-supervised learning framework named Temporal Contrastive Graph Learning (TCGL)
Our TCGL integrates the prior knowledge about the frame and snippet orders into graph structures, i.e., the intra-/inter- snippet Temporal Contrastive Graphs (TCG)
To generate supervisory signals for unlabeled videos, we introduce an Adaptive Snippet Order Prediction (ASOP) module.
- Score: 79.77010271213695
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video self-supervised learning is a challenging task, which requires
significant expressive power from the model to leverage rich spatial-temporal
knowledge and generate effective supervisory signals from large amounts of
unlabeled videos. However, existing methods fail to increase the temporal
diversity of unlabeled videos and ignore elaborately modeling multi-scale
temporal dependencies in an explicit way. To overcome these limitations, we
take advantage of the multi-scale temporal dependencies within videos and
proposes a novel video self-supervised learning framework named Temporal
Contrastive Graph Learning (TCGL), which jointly models the inter-snippet and
intra-snippet temporal dependencies for temporal representation learning with a
hybrid graph contrastive learning strategy. Specifically, a Spatial-Temporal
Knowledge Discovering (STKD) module is first introduced to extract
motion-enhanced spatial-temporal representations from videos based on the
frequency domain analysis of discrete cosine transform. To explicitly model
multi-scale temporal dependencies of unlabeled videos, our TCGL integrates the
prior knowledge about the frame and snippet orders into graph structures, i.e.,
the intra-/inter- snippet Temporal Contrastive Graphs (TCG). Then, specific
contrastive learning modules are designed to maximize the agreement between
nodes in different graph views. To generate supervisory signals for unlabeled
videos, we introduce an Adaptive Snippet Order Prediction (ASOP) module which
leverages the relational knowledge among video snippets to learn the global
context representation and recalibrate the channel-wise features adaptively.
Experimental results demonstrate the superiority of our TCGL over the
state-of-the-art methods on large-scale action recognition and video retrieval
benchmarks.
Related papers
- Self-Supervised Video Representation Learning via Latent Time Navigation [12.721647696921865]
Self-supervised video representation learning aims at maximizing similarity between different temporal segments of one video.
We propose Latent Time Navigation (LTN) to capture fine-grained motions.
Our experimental analysis suggests that learning video representations by LTN consistently improves performance of action classification.
arXiv Detail & Related papers (2023-05-10T20:06:17Z) - Leaping Into Memories: Space-Time Deep Feature Synthesis [93.10032043225362]
We propose LEAPS, an architecture-independent method for synthesizing videos from internal models.
We quantitatively and qualitatively evaluate the applicability of LEAPS by inverting a range of architectures convolutional attention-based on Kinetics-400.
arXiv Detail & Related papers (2023-03-17T12:55:22Z) - Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge
Transferring [82.84513669453744]
Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs.
We revisit temporal modeling in the context of image-to-video knowledge transferring.
We present a simple and effective temporal modeling mechanism extending CLIP model to diverse video tasks.
arXiv Detail & Related papers (2023-01-26T14:12:02Z) - Spatiotemporal Inconsistency Learning for DeepFake Video Detection [51.747219106855624]
We present a novel temporal modeling paradigm in TIM by exploiting the temporal difference over adjacent frames along with both horizontal and vertical directions.
And the ISM simultaneously utilizes the spatial information from SIM and temporal information from TIM to establish a more comprehensive spatial-temporal representation.
arXiv Detail & Related papers (2021-09-04T13:05:37Z) - Enhancing Self-supervised Video Representation Learning via Multi-level
Feature Optimization [30.670109727802494]
This paper proposes a multi-level feature optimization framework to improve the generalization and temporal modeling ability of learned video representations.
Experiments demonstrate that multi-level feature optimization with the graph constraint and temporal modeling can greatly improve the representation ability in video understanding.
arXiv Detail & Related papers (2021-08-04T17:16:18Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - Temporal Contrastive Graph Learning for Video Action Recognition and
Retrieval [83.56444443849679]
This work takes advantage of the temporal dependencies within videos and proposes a novel self-supervised method named Temporal Contrastive Graph Learning (TCGL)
Our TCGL roots in a hybrid graph contrastive learning strategy to jointly regard the inter-snippet and intra-snippet temporal dependencies as self-supervision signals for temporal representation learning.
Experimental results demonstrate the superiority of our TCGL over the state-of-the-art methods on large-scale action recognition and video retrieval benchmarks.
arXiv Detail & Related papers (2021-01-04T08:11:39Z) - Temporal Relational Modeling with Self-Supervision for Action
Segmentation [38.62057004624234]
We introduce Dilated Temporal Graph Reasoning Module (DTGRM) to model temporal relations in video.
In particular, we capture and model temporal relations via constructing multi-level dilated temporal graphs.
Our model outperforms state-of-the-art action segmentation models on three challenging datasets.
arXiv Detail & Related papers (2020-12-14T13:41:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.