SCD-Net: Spatiotemporal Clues Disentanglement Network for
Self-supervised Skeleton-based Action Recognition
- URL: http://arxiv.org/abs/2309.05834v1
- Date: Mon, 11 Sep 2023 21:32:13 GMT
- Title: SCD-Net: Spatiotemporal Clues Disentanglement Network for
Self-supervised Skeleton-based Action Recognition
- Authors: Cong Wu, Xiao-Jun Wu, Josef Kittler, Tianyang Xu, Sara Atito, Muhammad
Awais, Zhenhua Feng
- Abstract summary: This paper introduces a contrastive learning framework, namely Stemporal Clues Disentanglement Network (SCD-Net)
Specifically, we integrate the sequences with a feature extractor to derive explicit clues from spatial and temporal domains respectively.
We conduct evaluations on the NTU-+D (60&120) PKU-MMDI (&I) datasets, covering various downstream tasks such as action recognition, action retrieval, transfer learning.
- Score: 39.99711066167837
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Contrastive learning has achieved great success in skeleton-based action
recognition. However, most existing approaches encode the skeleton sequences as
entangled spatiotemporal representations and confine the contrasts to the same
level of representation. Instead, this paper introduces a novel contrastive
learning framework, namely Spatiotemporal Clues Disentanglement Network
(SCD-Net). Specifically, we integrate the decoupling module with a feature
extractor to derive explicit clues from spatial and temporal domains
respectively. As for the training of SCD-Net, with a constructed global anchor,
we encourage the interaction between the anchor and extracted clues. Further,
we propose a new masking strategy with structural constraints to strengthen the
contextual associations, leveraging the latest development from masked image
modelling into the proposed SCD-Net. We conduct extensive evaluations on the
NTU-RGB+D (60&120) and PKU-MMD (I&II) datasets, covering various downstream
tasks such as action recognition, action retrieval, transfer learning, and
semi-supervised learning. The experimental results demonstrate the
effectiveness of our method, which outperforms the existing state-of-the-art
(SOTA) approaches significantly.
Related papers
- Continual Panoptic Perception: Towards Multi-modal Incremental Interpretation of Remote Sensing Images [16.0258685984844]
Continual learning (CL) breaks off the one-way training manner and enables a model to adapt to new data, semantics and tasks continuously.
We propose a unified continual learning model that leverages multi-task joint learning covering pixel-level classification, instance-level segmentation and image-level perception.
arXiv Detail & Related papers (2024-07-19T12:22:32Z) - An Information Compensation Framework for Zero-Shot Skeleton-based Action Recognition [49.45660055499103]
Zero-shot human skeleton-based action recognition aims to construct a model that can recognize actions outside the categories seen during training.
Previous research has focused on aligning sequences' visual and semantic spatial distributions.
We introduce a new loss function sampling method to obtain a tight and robust representation.
arXiv Detail & Related papers (2024-06-02T06:53:01Z) - Cross-Video Contextual Knowledge Exploration and Exploitation for
Ambiguity Reduction in Weakly Supervised Temporal Action Localization [23.94629999419033]
Weakly supervised temporal action localization (WSTAL) aims to localize actions in untrimmed videos using video-level labels.
Our work addresses this from a novel perspective, by exploring and exploiting the cross-video contextual knowledge within the dataset.
Our method outperforms the state-of-the-art methods, and can be easily plugged into other WSTAL methods.
arXiv Detail & Related papers (2023-08-24T07:19:59Z) - Self-supervised Action Representation Learning from Partial
Spatio-Temporal Skeleton Sequences [29.376328807860993]
We propose a Partial Spatio-Temporal Learning (PSTL) framework to exploit the local relationship between different skeleton joints and video frames.
Our method achieves state-of-the-art performance on NTURGB+D 60, NTURGBMM+D 120 and PKU-D under various downstream tasks.
arXiv Detail & Related papers (2023-02-17T17:35:05Z) - Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based
Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage.
We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z) - SpatioTemporal Focus for Skeleton-based Action Recognition [66.8571926307011]
Graph convolutional networks (GCNs) are widely adopted in skeleton-based action recognition.
We argue that the performance of recent proposed skeleton-based action recognition methods is limited by the following factors.
Inspired by the recent attention mechanism, we propose a multi-grain contextual focus module, termed MCF, to capture the action associated relation information.
arXiv Detail & Related papers (2022-03-31T02:45:24Z) - Joint-bone Fusion Graph Convolutional Network for Semi-supervised
Skeleton Action Recognition [65.78703941973183]
We propose a novel correlation-driven joint-bone fusion graph convolutional network (CD-JBF-GCN) as an encoder and use a pose prediction head as a decoder.
Specifically, the CD-JBF-GC can explore the motion transmission between the joint stream and the bone stream.
The pose prediction based auto-encoder in the self-supervised training stage allows the network to learn motion representation from unlabeled data.
arXiv Detail & Related papers (2022-02-08T16:03:15Z) - Contrast-reconstruction Representation Learning for Self-supervised
Skeleton-based Action Recognition [18.667198945509114]
We propose a novel Contrast-Reconstruction Representation Learning network (CRRL)
It simultaneously captures postures and motion dynamics for unsupervised skeleton-based action recognition.
Experimental results on several benchmarks, i.e., NTU RGB+D 60, NTU RGB+D 120, CMU mocap, and NW-UCLA, demonstrate the promise of the proposed CRRL method.
arXiv Detail & Related papers (2021-11-22T08:45:34Z) - Spatial-Temporal Multi-Cue Network for Continuous Sign Language
Recognition [141.24314054768922]
We propose a spatial-temporal multi-cue (STMC) network to solve the vision-based sequence learning problem.
To validate the effectiveness, we perform experiments on three large-scale CSLR benchmarks.
arXiv Detail & Related papers (2020-02-08T15:38:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.