Joint Temporal Pooling for Improving Skeleton-based Action Recognition
- URL: http://arxiv.org/abs/2408.09356v1
- Date: Sun, 18 Aug 2024 04:40:16 GMT
- Title: Joint Temporal Pooling for Improving Skeleton-based Action Recognition
- Authors: Shanaka Ramesh Gunasekara, Wanqing Li, Jack Yang, Philip Ogunbona,
- Abstract summary: In skeleton-based human action recognition, temporal pooling is a critical step for capturing relationship of joint dynamics.
This paper presents a novel Adaptive Joint Motion Temporal Pooling (MAP) method for improving skeleton-based action recognition.
The efficacy of JMAP has been validated through experiments on the popular NTU RGBD+ 120 and PKU-MMD datasets.
- Score: 4.891381363264954
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In skeleton-based human action recognition, temporal pooling is a critical step for capturing spatiotemporal relationship of joint dynamics. Conventional pooling methods overlook the preservation of motion information and treat each frame equally. However, in an action sequence, only a few segments of frames carry discriminative information related to the action. This paper presents a novel Joint Motion Adaptive Temporal Pooling (JMAP) method for improving skeleton-based action recognition. Two variants of JMAP, frame-wise pooling and joint-wise pooling, are introduced. The efficacy of JMAP has been validated through experiments on the popular NTU RGB+D 120 and PKU-MMD datasets.
Related papers
- SkateFormer: Skeletal-Temporal Transformer for Human Action Recognition [25.341177384559174]
We propose a novel approach called Skeletal-Temporal Transformer (SkateFormer)
SkateFormer partitions joints and frames based on different types of skeletal-temporal relation.
It can selectively focus on key joints and frames crucial for action recognition in an action-adaptive manner.
arXiv Detail & Related papers (2024-03-14T15:55:53Z) - Self-supervised Action Representation Learning from Partial
Spatio-Temporal Skeleton Sequences [29.376328807860993]
We propose a Partial Spatio-Temporal Learning (PSTL) framework to exploit the local relationship between different skeleton joints and video frames.
Our method achieves state-of-the-art performance on NTURGB+D 60, NTURGBMM+D 120 and PKU-D under various downstream tasks.
arXiv Detail & Related papers (2023-02-17T17:35:05Z) - Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based
Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage.
We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z) - SpatioTemporal Focus for Skeleton-based Action Recognition [66.8571926307011]
Graph convolutional networks (GCNs) are widely adopted in skeleton-based action recognition.
We argue that the performance of recent proposed skeleton-based action recognition methods is limited by the following factors.
Inspired by the recent attention mechanism, we propose a multi-grain contextual focus module, termed MCF, to capture the action associated relation information.
arXiv Detail & Related papers (2022-03-31T02:45:24Z) - Joint-bone Fusion Graph Convolutional Network for Semi-supervised
Skeleton Action Recognition [65.78703941973183]
We propose a novel correlation-driven joint-bone fusion graph convolutional network (CD-JBF-GCN) as an encoder and use a pose prediction head as a decoder.
Specifically, the CD-JBF-GC can explore the motion transmission between the joint stream and the bone stream.
The pose prediction based auto-encoder in the self-supervised training stage allows the network to learn motion representation from unlabeled data.
arXiv Detail & Related papers (2022-02-08T16:03:15Z) - Modeling long-term interactions to enhance action recognition [81.09859029964323]
We propose a new approach to under-stand actions in egocentric videos that exploits the semantics of object interactions at both frame and temporal levels.
We use a region-based approach that takes as input a primary region roughly corresponding to the user hands and a set of secondary regions potentially corresponding to the interacting objects.
The proposed approach outperforms the state-of-the-art in terms of action recognition on standard benchmarks.
arXiv Detail & Related papers (2021-04-23T10:08:15Z) - JOLO-GCN: Mining Joint-Centered Light-Weight Information for
Skeleton-Based Action Recognition [47.47099206295254]
We propose a novel framework for employing human pose skeleton and joint-centered light-weight information jointly in a two-stream graph convolutional network.
Compared to the pure skeleton-based baseline, this hybrid scheme effectively boosts performance, while keeping the computational and memory overheads low.
arXiv Detail & Related papers (2020-11-16T08:39:22Z) - Pose And Joint-Aware Action Recognition [87.4780883700755]
We present a new model for joint-based action recognition, which first extracts motion features from each joint separately through a shared motion encoder.
Our joint selector module re-weights the joint information to select the most discriminative joints for the task.
We show large improvements over the current state-of-the-art joint-based approaches on JHMDB, HMDB, Charades, AVA action recognition datasets.
arXiv Detail & Related papers (2020-10-16T04:43:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.