Self-supervised Action Representation Learning from Partial
Spatio-Temporal Skeleton Sequences
- URL: http://arxiv.org/abs/2302.09018v1
- Date: Fri, 17 Feb 2023 17:35:05 GMT
- Title: Self-supervised Action Representation Learning from Partial
Spatio-Temporal Skeleton Sequences
- Authors: Yujie Zhou, Haodong Duan, Anyi Rao, Bing Su, Jiaqi Wang
- Abstract summary: We propose a Partial Spatio-Temporal Learning (PSTL) framework to exploit the local relationship between different skeleton joints and video frames.
Our method achieves state-of-the-art performance on NTURGB+D 60, NTURGBMM+D 120 and PKU-D under various downstream tasks.
- Score: 29.376328807860993
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised learning has demonstrated remarkable capability in
representation learning for skeleton-based action recognition. Existing methods
mainly focus on applying global data augmentation to generate different views
of the skeleton sequence for contrastive learning. However, due to the rich
action clues in the skeleton sequences, existing methods may only take a global
perspective to learn to discriminate different skeletons without thoroughly
leveraging the local relationship between different skeleton joints and video
frames, which is essential for real-world applications. In this work, we
propose a Partial Spatio-Temporal Learning (PSTL) framework to exploit the
local relationship from a partial skeleton sequences built by a unique
spatio-temporal masking strategy. Specifically, we construct a
negative-sample-free triplet steam structure that is composed of an anchor
stream without any masking, a spatial masking stream with Central Spatial
Masking (CSM), and a temporal masking stream with Motion Attention Temporal
Masking (MATM). The feature cross-correlation matrix is measured between the
anchor stream and the other two masking streams, respectively. (1) Central
Spatial Masking discards selected joints from the feature calculation process,
where the joints with a higher degree of centrality have a higher possibility
of being selected. (2) Motion Attention Temporal Masking leverages the motion
of action and remove frames that move faster with a higher possibility. Our
method achieves state-of-the-art performance on NTURGB+D 60, NTURGB+D 120 and
PKU-MMD under various downstream tasks. Furthermore, a practical evaluation is
performed where some skeleton joints are lost in downstream tasks.In contrast
to previous methods that suffer from large performance drops, our PSTL can
still achieve remarkable results under this challenging setting, validating the
robustness of our method.
Related papers
- Skeleton2vec: A Self-supervised Learning Framework with Contextualized
Target Representations for Skeleton Sequence [56.092059713922744]
We show that using high-level contextualized features as prediction targets can achieve superior performance.
Specifically, we propose Skeleton2vec, a simple and efficient self-supervised 3D action representation learning framework.
Our proposed Skeleton2vec outperforms previous methods and achieves state-of-the-art results.
arXiv Detail & Related papers (2024-01-01T12:08:35Z) - Unveiling the Hidden Realm: Self-supervised Skeleton-based Action
Recognition in Occluded Environments [41.664437160034176]
We propose a simple and effective method to empower robots with the capacity to address occlusion.
We first pre-train using occluded skeleton sequences, then use k-means clustering (KMeans) on sequence embeddings to group semantically similar samples.
We then employ K-nearest-neighbor (KNN) to fill in missing skeleton data based on the closest sample neighbors.
arXiv Detail & Related papers (2023-09-21T12:51:11Z) - Zero-shot Skeleton-based Action Recognition via Mutual Information
Estimation and Maximization [26.721082316870532]
Zero-shot skeleton-based action recognition aims to recognize actions of unseen categories after training on data of seen categories.
We propose a new zero-shot skeleton-based action recognition method via mutual information (MI) estimation and estimation.
arXiv Detail & Related papers (2023-08-07T23:41:55Z) - One-Shot Action Recognition via Multi-Scale Spatial-Temporal Skeleton
Matching [77.6989219290789]
One-shot skeleton action recognition aims to learn a skeleton action recognition model with a single training sample.
This paper presents a novel one-shot skeleton action recognition technique that handles skeleton action recognition via multi-scale spatial-temporal feature matching.
arXiv Detail & Related papers (2023-07-14T11:52:10Z) - SkeletonMAE: Spatial-Temporal Masked Autoencoders for Self-supervised
Skeleton Action Recognition [13.283178393519234]
Self-supervised skeleton-based action recognition has attracted more attention.
With utilizing the unlabeled data, more generalizable features can be learned to alleviate the overfitting problem.
We propose a spatial-temporal masked autoencoder framework for self-supervised 3D skeleton-based action recognition.
arXiv Detail & Related papers (2022-09-01T20:54:27Z) - Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based
Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage.
We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z) - SimMC: Simple Masked Contrastive Learning of Skeleton Representations
for Unsupervised Person Re-Identification [63.903237777588316]
We present a generic Simple Masked Contrastive learning (SimMC) framework to learn effective representations from unlabeled 3D skeletons for person re-ID.
Specifically, to fully exploit skeleton features within each skeleton sequence, we first devise a masked prototype contrastive learning (MPC) scheme.
Then, we propose the masked intra-sequence contrastive learning (MIC) to capture intra-sequence pattern consistency between subsequences.
arXiv Detail & Related papers (2022-04-21T00:19:38Z) - Joint-bone Fusion Graph Convolutional Network for Semi-supervised
Skeleton Action Recognition [65.78703941973183]
We propose a novel correlation-driven joint-bone fusion graph convolutional network (CD-JBF-GCN) as an encoder and use a pose prediction head as a decoder.
Specifically, the CD-JBF-GC can explore the motion transmission between the joint stream and the bone stream.
The pose prediction based auto-encoder in the self-supervised training stage allows the network to learn motion representation from unlabeled data.
arXiv Detail & Related papers (2022-02-08T16:03:15Z) - Learning Multi-Granular Spatio-Temporal Graph Network for Skeleton-based
Action Recognition [49.163326827954656]
We propose a novel multi-granular-temporal graph network for skeleton-based action classification.
We develop a dual-head graph network consisting of two inter-leaved branches, which enables us to extract at least two-temporal resolutions.
We conduct extensive experiments on three large-scale datasets.
arXiv Detail & Related papers (2021-08-10T09:25:07Z) - Sequential convolutional network for behavioral pattern extraction in
gait recognition [0.7874708385247353]
We propose a sequential convolutional network (SCN) to learn the walking pattern of individuals.
In SCN, behavioral information extractors (BIE) are constructed to comprehend intermediate feature maps in time series.
A multi-frame aggregator in SCN performs feature integration on a sequence whose length is uncertain, via a mobile 3D convolutional layer.
arXiv Detail & Related papers (2021-04-23T08:44:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.