Spatial-Temporal Decoupling Contrastive Learning for Skeleton-based
Human Action Recognition
- URL: http://arxiv.org/abs/2312.15144v3
- Date: Thu, 18 Jan 2024 14:10:02 GMT
- Title: Spatial-Temporal Decoupling Contrastive Learning for Skeleton-based
Human Action Recognition
- Authors: Shaojie Zhang, Jianqin Yin, and Yonghao Dang
- Abstract summary: STD-CL is a framework to obtain discriminative and semantically distinct representations from the sequences.
STD-CL achieves solid improvements on NTU60, NTU120, and NW-UCLA benchmarks.
- Score: 10.403751563214113
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Skeleton-based action recognition is a central task in human-computer
interaction. However, most previous methods suffer from two issues: (i)
semantic ambiguity arising from spatial-temporal information mixture; and (ii)
overlooking the explicit exploitation of the latent data distributions (i.e.,
the intra-class variations and inter-class relations), thereby leading to
sub-optimum solutions of the skeleton encoders. To mitigate this, we propose a
spatial-temporal decoupling contrastive learning (STD-CL) framework to obtain
discriminative and semantically distinct representations from the sequences,
which can be incorporated into various previous skeleton encoders and can be
removed when testing. Specifically, we decouple the global features into
spatial-specific and temporal-specific features to reduce the spatial-temporal
coupling of features. Furthermore, to explicitly exploit the latent data
distributions, we employ the attentive features to contrastive learning, which
models the cross-sequence semantic relations by pulling together the features
from the positive pairs and pushing away the negative pairs. Extensive
experiments show that STD-CL with four various skeleton encoders (HCN, 2S-AGCN,
CTR-GCN, and Hyperformer) achieves solid improvements on NTU60, NTU120, and
NW-UCLA benchmarks. The code will be released soon.
Related papers
- Neuron: Learning Context-Aware Evolving Representations for Zero-Shot Skeleton Action Recognition [64.56321246196859]
We propose a novel dyNamically Evolving dUal skeleton-semantic syneRgistic framework.
We first construct the spatial-temporal evolving micro-prototypes and integrate dynamic context-aware side information.
We introduce the spatial compression and temporal memory mechanisms to guide the growth of spatial-temporal micro-prototypes.
arXiv Detail & Related papers (2024-11-18T05:16:11Z) - An Information Compensation Framework for Zero-Shot Skeleton-based Action Recognition [49.45660055499103]
Zero-shot human skeleton-based action recognition aims to construct a model that can recognize actions outside the categories seen during training.
Previous research has focused on aligning sequences' visual and semantic spatial distributions.
We introduce a new loss function sampling method to obtain a tight and robust representation.
arXiv Detail & Related papers (2024-06-02T06:53:01Z) - FOCAL: Contrastive Learning for Multimodal Time-Series Sensing Signals
in Factorized Orthogonal Latent Space [7.324708513042455]
This paper proposes a novel contrastive learning framework, called FOCAL, for extracting comprehensive features from multimodal time-series sensing signals.
It consistently outperforms the state-of-the-art baselines in downstream tasks with a clear margin.
arXiv Detail & Related papers (2023-10-30T22:55:29Z) - Exploiting Spatial-temporal Data for Sleep Stage Classification via
Hypergraph Learning [16.802013781690402]
We propose a dynamic learning framework STHL, which introduces hypergraph to encode spatial-temporal data for sleep stage classification.
Our proposed STHL outperforms the state-of-the-art models in sleep stage classification tasks.
arXiv Detail & Related papers (2023-09-05T11:01:30Z) - Linking data separation, visual separation, and classifier performance
using pseudo-labeling by contrastive learning [125.99533416395765]
We argue that the performance of the final classifier depends on the data separation present in the latent space and visual separation present in the projection.
We demonstrate our results by the classification of five real-world challenging image datasets of human intestinal parasites with only 1% supervised samples.
arXiv Detail & Related papers (2023-02-06T10:01:38Z) - Spatiotemporal Decouple-and-Squeeze Contrastive Learning for
Semi-Supervised Skeleton-based Action Recognition [12.601122522537459]
We propose a novel Stemporal Decouple Contrastive Learning (SDS-CL) framework to learn more abundant representations of skeleton-based actions.
We present a new Temporal-squeezing Loss (STL), a new Temporal-squeezing Loss (TSL), and the Global-contrasting Loss (GL) to contrast the spatial-squeezing joint and motion features at the frame level, temporal-squeezing joint and motion features at the joint level, as well as global joint and motion features at the skeleton level.
arXiv Detail & Related papers (2023-02-05T06:52:25Z) - Leveraging Spatio-Temporal Dependency for Skeleton-Based Action
Recognition [9.999149887494646]
Skeleton-based action recognition has attracted considerable attention due to its compact representation of the human body's skeletal sucrture.
Many recent methods have achieved remarkable performance using graph convolutional networks (GCNs) and convolutional neural networks (CNNs)
arXiv Detail & Related papers (2022-12-09T10:37:22Z) - Learning Appearance-motion Normality for Video Anomaly Detection [11.658792932975652]
We propose spatial-temporal memories augmented two-stream auto-encoder framework.
It learns the appearance normality and motion normality independently and explores the correlations via adversarial learning.
Our framework outperforms the state-of-the-art methods, achieving AUCs of 98.1% and 89.8% on UCSD Ped2 and CUHK Avenue datasets.
arXiv Detail & Related papers (2022-07-27T08:30:19Z) - Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based
Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage.
We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - A Self-Supervised Gait Encoding Approach with Locality-Awareness for 3D
Skeleton Based Person Re-Identification [65.18004601366066]
Person re-identification (Re-ID) via gait features within 3D skeleton sequences is a newly-emerging topic with several advantages.
This paper proposes a self-supervised gait encoding approach that can leverage unlabeled skeleton data to learn gait representations for person Re-ID.
arXiv Detail & Related papers (2020-09-05T16:06:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.