Self-supervised Action Representation Learning from Partial
Spatio-Temporal Skeleton Sequences
- URL: http://arxiv.org/abs/2302.09018v1
- Date: Fri, 17 Feb 2023 17:35:05 GMT
- Title: Self-supervised Action Representation Learning from Partial
Spatio-Temporal Skeleton Sequences
- Authors: Yujie Zhou, Haodong Duan, Anyi Rao, Bing Su, Jiaqi Wang
- Abstract summary: We propose a Partial Spatio-Temporal Learning (PSTL) framework to exploit the local relationship between different skeleton joints and video frames.
Our method achieves state-of-the-art performance on NTURGB+D 60, NTURGBMM+D 120 and PKU-D under various downstream tasks.
- Score: 29.376328807860993
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised learning has demonstrated remarkable capability in
representation learning for skeleton-based action recognition. Existing methods
mainly focus on applying global data augmentation to generate different views
of the skeleton sequence for contrastive learning. However, due to the rich
action clues in the skeleton sequences, existing methods may only take a global
perspective to learn to discriminate different skeletons without thoroughly
leveraging the local relationship between different skeleton joints and video
frames, which is essential for real-world applications. In this work, we
propose a Partial Spatio-Temporal Learning (PSTL) framework to exploit the
local relationship from a partial skeleton sequences built by a unique
spatio-temporal masking strategy. Specifically, we construct a
negative-sample-free triplet steam structure that is composed of an anchor
stream without any masking, a spatial masking stream with Central Spatial
Masking (CSM), and a temporal masking stream with Motion Attention Temporal
Masking (MATM). The feature cross-correlation matrix is measured between the
anchor stream and the other two masking streams, respectively. (1) Central
Spatial Masking discards selected joints from the feature calculation process,
where the joints with a higher degree of centrality have a higher possibility
of being selected. (2) Motion Attention Temporal Masking leverages the motion
of action and remove frames that move faster with a higher possibility. Our
method achieves state-of-the-art performance on NTURGB+D 60, NTURGB+D 120 and
PKU-MMD under various downstream tasks. Furthermore, a practical evaluation is
performed where some skeleton joints are lost in downstream tasks.In contrast
to previous methods that suffer from large performance drops, our PSTL can
still achieve remarkable results under this challenging setting, validating the
robustness of our method.
Related papers
- Spatial Hierarchy and Temporal Attention Guided Cross Masking for Self-supervised Skeleton-based Action Recognition [4.036669828958854]
We introduce a hierarchy and attention guided cross-masking framework (HA-CM) that applies masking to skeleton sequences from both spatial and temporal perspectives.
In spatial graphs, we utilize hyperbolic space to maintain joint distinctions and effectively preserve the hierarchical structure of high-dimensional skeletons.
In temporal flows, we substitute traditional distance metrics with the global attention of joints for masking, addressing the convergence of distances in high-dimensional space and the lack of a global perspective.
arXiv Detail & Related papers (2024-09-26T15:28:25Z) - Skeleton2vec: A Self-supervised Learning Framework with Contextualized
Target Representations for Skeleton Sequence [56.092059713922744]
We show that using high-level contextualized features as prediction targets can achieve superior performance.
Specifically, we propose Skeleton2vec, a simple and efficient self-supervised 3D action representation learning framework.
Our proposed Skeleton2vec outperforms previous methods and achieves state-of-the-art results.
arXiv Detail & Related papers (2024-01-01T12:08:35Z) - Exploring Self-Supervised Skeleton-Based Human Action Recognition under Occlusions [40.322770236718775]
We propose a method to integrate self-supervised skeleton-based action recognition methods into autonomous robotic systems.
We first pre-train using occluded skeleton sequences, then use k-means clustering (KMeans) on sequence embeddings to group semantically similar samples.
Imputing incomplete skeleton sequences to create relatively complete sequences provides significant benefits to existing skeleton-based self-supervised methods.
arXiv Detail & Related papers (2023-09-21T12:51:11Z) - One-Shot Action Recognition via Multi-Scale Spatial-Temporal Skeleton
Matching [77.6989219290789]
One-shot skeleton action recognition aims to learn a skeleton action recognition model with a single training sample.
This paper presents a novel one-shot skeleton action recognition technique that handles skeleton action recognition via multi-scale spatial-temporal feature matching.
arXiv Detail & Related papers (2023-07-14T11:52:10Z) - SkeletonMAE: Spatial-Temporal Masked Autoencoders for Self-supervised
Skeleton Action Recognition [13.283178393519234]
Self-supervised skeleton-based action recognition has attracted more attention.
With utilizing the unlabeled data, more generalizable features can be learned to alleviate the overfitting problem.
We propose a spatial-temporal masked autoencoder framework for self-supervised 3D skeleton-based action recognition.
arXiv Detail & Related papers (2022-09-01T20:54:27Z) - Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based
Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage.
We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z) - SimMC: Simple Masked Contrastive Learning of Skeleton Representations
for Unsupervised Person Re-Identification [63.903237777588316]
We present a generic Simple Masked Contrastive learning (SimMC) framework to learn effective representations from unlabeled 3D skeletons for person re-ID.
Specifically, to fully exploit skeleton features within each skeleton sequence, we first devise a masked prototype contrastive learning (MPC) scheme.
Then, we propose the masked intra-sequence contrastive learning (MIC) to capture intra-sequence pattern consistency between subsequences.
arXiv Detail & Related papers (2022-04-21T00:19:38Z) - Joint-bone Fusion Graph Convolutional Network for Semi-supervised
Skeleton Action Recognition [65.78703941973183]
We propose a novel correlation-driven joint-bone fusion graph convolutional network (CD-JBF-GCN) as an encoder and use a pose prediction head as a decoder.
Specifically, the CD-JBF-GC can explore the motion transmission between the joint stream and the bone stream.
The pose prediction based auto-encoder in the self-supervised training stage allows the network to learn motion representation from unlabeled data.
arXiv Detail & Related papers (2022-02-08T16:03:15Z) - Learning Multi-Granular Spatio-Temporal Graph Network for Skeleton-based
Action Recognition [49.163326827954656]
We propose a novel multi-granular-temporal graph network for skeleton-based action classification.
We develop a dual-head graph network consisting of two inter-leaved branches, which enables us to extract at least two-temporal resolutions.
We conduct extensive experiments on three large-scale datasets.
arXiv Detail & Related papers (2021-08-10T09:25:07Z) - Sequential convolutional network for behavioral pattern extraction in
gait recognition [0.7874708385247353]
We propose a sequential convolutional network (SCN) to learn the walking pattern of individuals.
In SCN, behavioral information extractors (BIE) are constructed to comprehend intermediate feature maps in time series.
A multi-frame aggregator in SCN performs feature integration on a sequence whose length is uncertain, via a mobile 3D convolutional layer.
arXiv Detail & Related papers (2021-04-23T08:44:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.