SkeletonMAE: Spatial-Temporal Masked Autoencoders for Self-supervised
Skeleton Action Recognition
- URL: http://arxiv.org/abs/2209.02399v2
- Date: Wed, 10 May 2023 02:22:07 GMT
- Title: SkeletonMAE: Spatial-Temporal Masked Autoencoders for Self-supervised
Skeleton Action Recognition
- Authors: Wenhan Wu, Yilei Hua, Ce Zheng, Shiqian Wu, Chen Chen, Aidong Lu
- Abstract summary: Self-supervised skeleton-based action recognition has attracted more attention.
With utilizing the unlabeled data, more generalizable features can be learned to alleviate the overfitting problem.
We propose a spatial-temporal masked autoencoder framework for self-supervised 3D skeleton-based action recognition.
- Score: 13.283178393519234
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fully supervised skeleton-based action recognition has achieved great
progress with the blooming of deep learning techniques. However, these methods
require sufficient labeled data which is not easy to obtain. In contrast,
self-supervised skeleton-based action recognition has attracted more attention.
With utilizing the unlabeled data, more generalizable features can be learned
to alleviate the overfitting problem and reduce the demand of massive labeled
training data. Inspired by the MAE, we propose a spatial-temporal masked
autoencoder framework for self-supervised 3D skeleton-based action recognition
(SkeletonMAE). Following MAE's masking and reconstruction pipeline, we utilize
a skeleton-based encoder-decoder transformer architecture to reconstruct the
masked skeleton sequences. A novel masking strategy, named Spatial-Temporal
Masking, is introduced in terms of both joint-level and frame-level for the
skeleton sequence. This pre-training strategy makes the encoder output
generalizable skeleton features with spatial and temporal dependencies. Given
the unmasked skeleton sequence, the encoder is fine-tuned for the action
recognition task. Extensive experiments show that our SkeletonMAE achieves
remarkable performance and outperforms the state-of-the-art methods on both NTU
RGB+D and NTU RGB+D 120 datasets.
Related papers
- Spatial Hierarchy and Temporal Attention Guided Cross Masking for Self-supervised Skeleton-based Action Recognition [4.036669828958854]
We introduce a hierarchy and attention guided cross-masking framework (HA-CM) that applies masking to skeleton sequences from both spatial and temporal perspectives.
In spatial graphs, we utilize hyperbolic space to maintain joint distinctions and effectively preserve the hierarchical structure of high-dimensional skeletons.
In temporal flows, we substitute traditional distance metrics with the global attention of joints for masking, addressing the convergence of distances in high-dimensional space and the lack of a global perspective.
arXiv Detail & Related papers (2024-09-26T15:28:25Z) - ReL-SAR: Representation Learning for Skeleton Action Recognition with Convolutional Transformers and BYOL [6.603505460200282]
Unsupervised representation learning is of prime importance to leverage unlabeled skeleton data.
We design a lightweight convolutional transformer framework, named ReL-SAR, for jointly modeling spatial and temporal cues in skeleton sequences.
We capitalize on Bootstrap Your Own Latent (BYOL) to learn robust representations from unlabeled skeleton sequence data.
arXiv Detail & Related papers (2024-09-09T16:03:26Z) - Skeleton2vec: A Self-supervised Learning Framework with Contextualized
Target Representations for Skeleton Sequence [56.092059713922744]
We show that using high-level contextualized features as prediction targets can achieve superior performance.
Specifically, we propose Skeleton2vec, a simple and efficient self-supervised 3D action representation learning framework.
Our proposed Skeleton2vec outperforms previous methods and achieves state-of-the-art results.
arXiv Detail & Related papers (2024-01-01T12:08:35Z) - SkeletonMAE: Graph-based Masked Autoencoder for Skeleton Sequence
Pre-training [110.55093254677638]
We propose an efficient skeleton sequence learning framework, named Skeleton Sequence Learning (SSL)
In this paper, we build an asymmetric graph-based encoder-decoder pre-training architecture named SkeletonMAE.
Our SSL generalizes well across different datasets and outperforms the state-of-the-art self-supervised skeleton-based action recognition methods.
arXiv Detail & Related papers (2023-07-17T13:33:11Z) - Self-Supervised 3D Action Representation Learning with Skeleton Cloud
Colorization [75.0912240667375]
3D Skeleton-based human action recognition has attracted increasing attention in recent years.
Most of the existing work focuses on supervised learning which requires a large number of labeled action sequences.
In this paper, we address self-supervised 3D action representation learning for skeleton-based action recognition.
arXiv Detail & Related papers (2023-04-18T08:03:26Z) - Self-supervised Action Representation Learning from Partial
Spatio-Temporal Skeleton Sequences [29.376328807860993]
We propose a Partial Spatio-Temporal Learning (PSTL) framework to exploit the local relationship between different skeleton joints and video frames.
Our method achieves state-of-the-art performance on NTURGB+D 60, NTURGBMM+D 120 and PKU-D under various downstream tasks.
arXiv Detail & Related papers (2023-02-17T17:35:05Z) - SimMC: Simple Masked Contrastive Learning of Skeleton Representations
for Unsupervised Person Re-Identification [63.903237777588316]
We present a generic Simple Masked Contrastive learning (SimMC) framework to learn effective representations from unlabeled 3D skeletons for person re-ID.
Specifically, to fully exploit skeleton features within each skeleton sequence, we first devise a masked prototype contrastive learning (MPC) scheme.
Then, we propose the masked intra-sequence contrastive learning (MIC) to capture intra-sequence pattern consistency between subsequences.
arXiv Detail & Related papers (2022-04-21T00:19:38Z) - Skeleton Cloud Colorization for Unsupervised 3D Action Representation
Learning [65.88887113157627]
Skeleton-based human action recognition has attracted increasing attention in recent years.
We design a novel skeleton cloud colorization technique that is capable of learning skeleton representations from unlabeled skeleton sequence data.
We show that the proposed method outperforms existing unsupervised and semi-supervised 3D action recognition methods by large margins.
arXiv Detail & Related papers (2021-08-04T10:55:39Z) - A Self-Supervised Gait Encoding Approach with Locality-Awareness for 3D
Skeleton Based Person Re-Identification [65.18004601366066]
Person re-identification (Re-ID) via gait features within 3D skeleton sequences is a newly-emerging topic with several advantages.
This paper proposes a self-supervised gait encoding approach that can leverage unlabeled skeleton data to learn gait representations for person Re-ID.
arXiv Detail & Related papers (2020-09-05T16:06:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.