Temporal Contrastive Graph Learning for Video Action Recognition and
Retrieval
- URL: http://arxiv.org/abs/2101.00820v8
- Date: Wed, 17 Mar 2021 03:32:52 GMT
- Title: Temporal Contrastive Graph Learning for Video Action Recognition and
Retrieval
- Authors: Yang Liu, Keze Wang, Haoyuan Lan, Liang Lin
- Abstract summary: This work takes advantage of the temporal dependencies within videos and proposes a novel self-supervised method named Temporal Contrastive Graph Learning (TCGL)
Our TCGL roots in a hybrid graph contrastive learning strategy to jointly regard the inter-snippet and intra-snippet temporal dependencies as self-supervision signals for temporal representation learning.
Experimental results demonstrate the superiority of our TCGL over the state-of-the-art methods on large-scale action recognition and video retrieval benchmarks.
- Score: 83.56444443849679
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attempt to fully discover the temporal diversity and chronological
characteristics for self-supervised video representation learning, this work
takes advantage of the temporal dependencies within videos and further proposes
a novel self-supervised method named Temporal Contrastive Graph Learning
(TCGL). In contrast to the existing methods that ignore modeling elaborate
temporal dependencies, our TCGL roots in a hybrid graph contrastive learning
strategy to jointly regard the inter-snippet and intra-snippet temporal
dependencies as self-supervision signals for temporal representation learning.
To model multi-scale temporal dependencies, our TCGL integrates the prior
knowledge about the frame and snippet orders into graph structures, i.e., the
intra-/inter- snippet temporal contrastive graphs. By randomly removing edges
and masking nodes of the intra-snippet graphs or inter-snippet graphs, our TCGL
can generate different correlated graph views. Then, specific contrastive
learning modules are designed to maximize the agreement between nodes in
different views. To adaptively learn the global context representation and
recalibrate the channel-wise features, we introduce an adaptive video snippet
order prediction module, which leverages the relational knowledge among video
snippets to predict the actual snippet orders. Experimental results demonstrate
the superiority of our TCGL over the state-of-the-art methods on large-scale
action recognition and video retrieval benchmarks.
Related papers
- Temporal Graph Representation Learning with Adaptive Augmentation
Contrastive [12.18909612212823]
Temporal graph representation learning aims to generate low-dimensional dynamic node embeddings to capture temporal information.
We propose a novel Temporal Graph representation learning with Adaptive augmentation Contrastive (TGAC) model.
Our experiments on various real networks demonstrate that the proposed model outperforms other temporal graph representation learning methods.
arXiv Detail & Related papers (2023-11-07T11:21:16Z) - Time-aware Graph Structure Learning via Sequence Prediction on Temporal
Graphs [10.034072706245544]
We propose a Time-aware Graph Structure Learning (TGSL) approach via sequence prediction on temporal graphs.
In particular, it predicts time-aware context embedding and uses the Gumble-Top-K to select the closest candidate edges to this context embedding.
Experiments on temporal link prediction benchmarks demonstrate that TGSL yields significant gains for the popular TGNs such as TGAT and GraphMixer.
arXiv Detail & Related papers (2023-06-13T11:34:36Z) - Deep Temporal Graph Clustering [77.02070768950145]
We propose a general framework for deep Temporal Graph Clustering (GC)
GC introduces deep clustering techniques to suit the interaction sequence-based batch-processing pattern of temporal graphs.
Our framework can effectively improve the performance of existing temporal graph learning methods.
arXiv Detail & Related papers (2023-05-18T06:17:50Z) - Self-Supervised Video Representation Learning via Latent Time Navigation [12.721647696921865]
Self-supervised video representation learning aims at maximizing similarity between different temporal segments of one video.
We propose Latent Time Navigation (LTN) to capture fine-grained motions.
Our experimental analysis suggests that learning video representations by LTN consistently improves performance of action classification.
arXiv Detail & Related papers (2023-05-10T20:06:17Z) - TodyNet: Temporal Dynamic Graph Neural Network for Multivariate Time
Series Classification [6.76723360505692]
We propose a novel temporal dynamic neural graph network (TodyNet) that can extract hidden-temporal dependencies without undefined graph structure.
The experiments on 26 UEA benchmark datasets illustrate that the proposed TodyNet outperforms existing deep learning-based methods in the MTSC tasks.
arXiv Detail & Related papers (2023-04-11T09:21:28Z) - TCGL: Temporal Contrastive Graph for Self-supervised Video
Representation Learning [79.77010271213695]
We propose a novel video self-supervised learning framework named Temporal Contrastive Graph Learning (TCGL)
Our TCGL integrates the prior knowledge about the frame and snippet orders into graph structures, i.e., the intra-/inter- snippet Temporal Contrastive Graphs (TCG)
To generate supervisory signals for unlabeled videos, we introduce an Adaptive Snippet Order Prediction (ASOP) module.
arXiv Detail & Related papers (2021-12-07T09:27:56Z) - Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z) - Modelling Neighbor Relation in Joint Space-Time Graph for Video
Correspondence Learning [53.74240452117145]
This paper presents a self-supervised method for learning reliable visual correspondence from unlabeled videos.
We formulate the correspondence as finding paths in a joint space-time graph, where nodes are grid patches sampled from frames, and are linked by two types of edges.
Our learned representation outperforms the state-of-the-art self-supervised methods on a variety of visual tasks.
arXiv Detail & Related papers (2021-09-28T05:40:01Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.