SkateFormer: Skeletal-Temporal Transformer for Human Action Recognition
- URL: http://arxiv.org/abs/2403.09508v3
- Date: Wed, 17 Jul 2024 07:00:42 GMT
- Title: SkateFormer: Skeletal-Temporal Transformer for Human Action Recognition
- Authors: Jeonghyeok Do, Munchurl Kim,
- Abstract summary: We propose a novel approach called Skeletal-Temporal Transformer (SkateFormer)
SkateFormer partitions joints and frames based on different types of skeletal-temporal relation.
It can selectively focus on key joints and frames crucial for action recognition in an action-adaptive manner.
- Score: 25.341177384559174
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Skeleton-based action recognition, which classifies human actions based on the coordinates of joints and their connectivity within skeleton data, is widely utilized in various scenarios. While Graph Convolutional Networks (GCNs) have been proposed for skeleton data represented as graphs, they suffer from limited receptive fields constrained by joint connectivity. To address this limitation, recent advancements have introduced transformer-based methods. However, capturing correlations between all joints in all frames requires substantial memory resources. To alleviate this, we propose a novel approach called Skeletal-Temporal Transformer (SkateFormer) that partitions joints and frames based on different types of skeletal-temporal relation (Skate-Type) and performs skeletal-temporal self-attention (Skate-MSA) within each partition. We categorize the key skeletal-temporal relations for action recognition into a total of four distinct types. These types combine (i) two skeletal relation types based on physically neighboring and distant joints, and (ii) two temporal relation types based on neighboring and distant frames. Through this partition-specific attention strategy, our SkateFormer can selectively focus on key joints and frames crucial for action recognition in an action-adaptive manner with efficient computation. Extensive experiments on various benchmark datasets validate that our SkateFormer outperforms recent state-of-the-art methods.
Related papers
- SkeleTR: Towrads Skeleton-based Action Recognition in the Wild [86.03082891242698]
SkeleTR is a new framework for skeleton-based action recognition.
It first models the intra-person skeleton dynamics for each skeleton sequence with graph convolutions.
It then uses stacked Transformer encoders to capture person interactions that are important for action recognition in general scenarios.
arXiv Detail & Related papers (2023-09-20T16:22:33Z) - One-Shot Action Recognition via Multi-Scale Spatial-Temporal Skeleton
Matching [77.6989219290789]
One-shot skeleton action recognition aims to learn a skeleton action recognition model with a single training sample.
This paper presents a novel one-shot skeleton action recognition technique that handles skeleton action recognition via multi-scale spatial-temporal feature matching.
arXiv Detail & Related papers (2023-07-14T11:52:10Z) - Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based
Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage.
We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z) - Joint-bone Fusion Graph Convolutional Network for Semi-supervised
Skeleton Action Recognition [65.78703941973183]
We propose a novel correlation-driven joint-bone fusion graph convolutional network (CD-JBF-GCN) as an encoder and use a pose prediction head as a decoder.
Specifically, the CD-JBF-GC can explore the motion transmission between the joint stream and the bone stream.
The pose prediction based auto-encoder in the self-supervised training stage allows the network to learn motion representation from unlabeled data.
arXiv Detail & Related papers (2022-02-08T16:03:15Z) - Spatio-Temporal Tuples Transformer for Skeleton-Based Action Recognition [8.905895607185135]
Transformer shows great potential to model the correlation of important joints.
Existing Transformer-based methods cannot capture the correlation of different joints between frames.
Atemporal-recognition module is proposed to capture the relationship of different joints in consecutive frames.
arXiv Detail & Related papers (2022-01-08T16:03:01Z) - JOLO-GCN: Mining Joint-Centered Light-Weight Information for
Skeleton-Based Action Recognition [47.47099206295254]
We propose a novel framework for employing human pose skeleton and joint-centered light-weight information jointly in a two-stream graph convolutional network.
Compared to the pure skeleton-based baseline, this hybrid scheme effectively boosts performance, while keeping the computational and memory overheads low.
arXiv Detail & Related papers (2020-11-16T08:39:22Z) - Gesture Recognition from Skeleton Data for Intuitive Human-Machine
Interaction [0.6875312133832077]
We propose an approach for segmentation and classification of dynamic gestures based on a set of handcrafted features.
The method for gesture recognition applies a sliding window, which extracts information from both the spatial and temporal dimensions.
At the end, the recognized gestures are used to interact with a collaborative robot.
arXiv Detail & Related papers (2020-08-26T11:28:50Z) - Skeleton-Aware Networks for Deep Motion Retargeting [83.65593033474384]
We introduce a novel deep learning framework for data-driven motion between skeletons.
Our approach learns how to retarget without requiring any explicit pairing between the motions in the training set.
arXiv Detail & Related papers (2020-05-12T12:51:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.