Related papers: Spatio-Temporal Joint Density Driven Learning for Skeleton-Based Action Recognition

Spatio-Temporal Joint Density Driven Learning for Skeleton-Based Action Recognition

URL: http://arxiv.org/abs/2505.23012v1
Date: Thu, 29 May 2025 02:40:47 GMT
Title: Spatio-Temporal Joint Density Driven Learning for Skeleton-Based Action Recognition
Authors: Shanaka Ramesh Gunasekara, Wanqing Li, Philip Ogunbona, Jack Yang,
Abstract summary: This paper introduces a novel measurement, referred to as spatial-temporal joint density (STJD), to quantify such interaction.<n> Tracking the evolution of this density throughout an action can effectively identify a subset of discriminative moving and/or static joints.<n>A new contrastive learning strategy named STJD-CL is proposed to align the representation of a skeleton sequence with that of its prime joints.
Score: 4.891381363264954
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Traditional approaches in unsupervised or self supervised learning for skeleton-based action classification have concentrated predominantly on the dynamic aspects of skeletal sequences. Yet, the intricate interaction between the moving and static elements of the skeleton presents a rarely tapped discriminative potential for action classification. This paper introduces a novel measurement, referred to as spatial-temporal joint density (STJD), to quantify such interaction. Tracking the evolution of this density throughout an action can effectively identify a subset of discriminative moving and/or static joints termed "prime joints" to steer self-supervised learning. A new contrastive learning strategy named STJD-CL is proposed to align the representation of a skeleton sequence with that of its prime joints while simultaneously contrasting the representations of prime and nonprime joints. In addition, a method called STJD-MP is developed by integrating it with a reconstruction-based framework for more effective learning. Experimental evaluations on the NTU RGB+D 60, NTU RGB+D 120, and PKUMMD datasets in various downstream tasks demonstrate that the proposed STJD-CL and STJD-MP improved performance, particularly by 3.5 and 3.6 percentage points over the state-of-the-art contrastive methods on the NTU RGB+D 120 dataset using X-sub and X-set evaluations, respectively.

Related papers

MS-CLR: Multi-Skeleton Contrastive Learning for Human Action Recognition [49.91188543847175]
Multi-Skeleton Contrastive Learning (MS-CLR) is a framework that aligns pose representations across multiple skeleton conventions extracted from the same sequence.<n>MS-CLR consistently improves performance over strong single-skeleton contrastive learning baselines.<n>A multi-skeleton ensemble further boosts performance, setting new state-of-the-art results on both datasets.
arXiv Detail & Related papers (2025-08-20T17:58:03Z)
Joint Temporal Pooling for Improving Skeleton-based Action Recognition [4.891381363264954]
In skeleton-based human action recognition, temporal pooling is a critical step for capturing relationship of joint dynamics. This paper presents a novel Adaptive Joint Motion Temporal Pooling (MAP) method for improving skeleton-based action recognition. The efficacy of JMAP has been validated through experiments on the popular NTU RGBD+ 120 and PKU-MMD datasets.
arXiv Detail & Related papers (2024-08-18T04:40:16Z)
Skeleton2vec: A Self-supervised Learning Framework with Contextualized Target Representations for Skeleton Sequence [56.092059713922744]
We show that using high-level contextualized features as prediction targets can achieve superior performance. Specifically, we propose Skeleton2vec, a simple and efficient self-supervised 3D action representation learning framework. Our proposed Skeleton2vec outperforms previous methods and achieves state-of-the-art results.
arXiv Detail & Related papers (2024-01-01T12:08:35Z)
SCD-Net: Spatiotemporal Clues Disentanglement Network for Self-supervised Skeleton-based Action Recognition [39.99711066167837]
This paper introduces a contrastive learning framework, namely Stemporal Clues Disentanglement Network (SCD-Net) Specifically, we integrate the sequences with a feature extractor to derive explicit clues from spatial and temporal domains respectively. We conduct evaluations on the NTU-+D (60&120) PKU-MMDI (&I) datasets, covering various downstream tasks such as action recognition, action retrieval, transfer learning.
arXiv Detail & Related papers (2023-09-11T21:32:13Z)
Self-supervised Action Representation Learning from Partial Spatio-Temporal Skeleton Sequences [29.376328807860993]
We propose a Partial Spatio-Temporal Learning (PSTL) framework to exploit the local relationship between different skeleton joints and video frames. Our method achieves state-of-the-art performance on NTURGB+D 60, NTURGBMM+D 120 and PKU-D under various downstream tasks.
arXiv Detail & Related papers (2023-02-17T17:35:05Z)
Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage. We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets. By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z)
SpatioTemporal Focus for Skeleton-based Action Recognition [66.8571926307011]
Graph convolutional networks (GCNs) are widely adopted in skeleton-based action recognition. We argue that the performance of recent proposed skeleton-based action recognition methods is limited by the following factors. Inspired by the recent attention mechanism, we propose a multi-grain contextual focus module, termed MCF, to capture the action associated relation information.
arXiv Detail & Related papers (2022-03-31T02:45:24Z)
Joint-bone Fusion Graph Convolutional Network for Semi-supervised Skeleton Action Recognition [65.78703941973183]
We propose a novel correlation-driven joint-bone fusion graph convolutional network (CD-JBF-GCN) as an encoder and use a pose prediction head as a decoder. Specifically, the CD-JBF-GC can explore the motion transmission between the joint stream and the bone stream. The pose prediction based auto-encoder in the self-supervised training stage allows the network to learn motion representation from unlabeled data.
arXiv Detail & Related papers (2022-02-08T16:03:15Z)
Contrast-reconstruction Representation Learning for Self-supervised Skeleton-based Action Recognition [18.667198945509114]
We propose a novel Contrast-Reconstruction Representation Learning network (CRRL) It simultaneously captures postures and motion dynamics for unsupervised skeleton-based action recognition. Experimental results on several benchmarks, i.e., NTU RGB+D 60, NTU RGB+D 120, CMU mocap, and NW-UCLA, demonstrate the promise of the proposed CRRL method.
arXiv Detail & Related papers (2021-11-22T08:45:34Z)
3D Human Action Representation Learning via Cross-View Consistency Pursuit [52.19199260960558]
We propose a Cross-view Contrastive Learning framework for unsupervised 3D skeleton-based action Representation (CrosSCLR) CrosSCLR consists of both single-view contrastive learning (SkeletonCLR) and cross-view consistent knowledge mining (CVC-KM) modules, integrated in a collaborative learning manner.
arXiv Detail & Related papers (2021-04-29T16:29:41Z)
JOLO-GCN: Mining Joint-Centered Light-Weight Information for Skeleton-Based Action Recognition [47.47099206295254]
We propose a novel framework for employing human pose skeleton and joint-centered light-weight information jointly in a two-stream graph convolutional network. Compared to the pure skeleton-based baseline, this hybrid scheme effectively boosts performance, while keeping the computational and memory overheads low.
arXiv Detail & Related papers (2020-11-16T08:39:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.