STEP CATFormer: Spatial-Temporal Effective Body-Part Cross Attention
Transformer for Skeleton-based Action Recognition
- URL: http://arxiv.org/abs/2312.03288v1
- Date: Wed, 6 Dec 2023 04:36:58 GMT
- Title: STEP CATFormer: Spatial-Temporal Effective Body-Part Cross Attention
Transformer for Skeleton-based Action Recognition
- Authors: Nguyen Huu Bao Long
- Abstract summary: We focus on how the Graph Convolutional Convolution networks learn different topologies and effectively aggregate joint features in the global temporal and local temporal.
We propose three Channel-wise Tolopogy Graph Convolution based on Channel-wise Topology Refinement Graph Convolution (CTR-GCN)
We develop a powerful graph convolutional network named Spatial Temporal Effective Body-part Cross Attention Transformer which notably high-performance on the NTU RGB+D, NTU RGB+D 120 datasets.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Graph convolutional networks (GCNs) have been widely used and achieved
remarkable results in skeleton-based action recognition. We think the key to
skeleton-based action recognition is a skeleton hanging in frames, so we focus
on how the Graph Convolutional Convolution networks learn different topologies
and effectively aggregate joint features in the global temporal and local
temporal. In this work, we propose three Channel-wise Tolopogy Graph
Convolution based on Channel-wise Topology Refinement Graph Convolution
(CTR-GCN). Combining CTR-GCN with two joint cross-attention modules can capture
the upper-lower body part and hand-foot relationship skeleton features. After
that, to capture features of human skeletons changing in frames we design the
Temporal Attention Transformers to extract skeletons effectively. The Temporal
Attention Transformers can learn the temporal features of human skeleton
sequences. Finally, we fuse the temporal features output scale with MLP and
classification. We develop a powerful graph convolutional network named Spatial
Temporal Effective Body-part Cross Attention Transformer which notably
high-performance on the NTU RGB+D, NTU RGB+D 120 datasets. Our code and models
are available at https://github.com/maclong01/STEP-CATFormer
Related papers
- Signal-SGN: A Spiking Graph Convolutional Network for Skeletal Action Recognition via Learning Temporal-Frequency Dynamics [2.9578022754506605]
In skeletal-based action recognition, Graph Convolutional Networks (GCNs) face limitations due to their complexity and high energy consumption.
We propose a Signal-SGN(Spiking Graph Convolutional Network), which leverages the temporal dimension of skeletal sequences as the spiking timestep.
Our experiments show that the proposed models not only surpass existing SNN-based methods in accuracy but also reduce computational storage costs during training.
arXiv Detail & Related papers (2024-08-03T07:47:16Z) - SkeleTR: Towrads Skeleton-based Action Recognition in the Wild [86.03082891242698]
SkeleTR is a new framework for skeleton-based action recognition.
It first models the intra-person skeleton dynamics for each skeleton sequence with graph convolutions.
It then uses stacked Transformer encoders to capture person interactions that are important for action recognition in general scenarios.
arXiv Detail & Related papers (2023-09-20T16:22:33Z) - Pose-Guided Graph Convolutional Networks for Skeleton-Based Action
Recognition [32.07659338674024]
Graph convolutional networks (GCNs) can model the human body skeletons as spatial and temporal graphs.
In this work, we propose pose-guided GCN (PG-GCN), a multi-modal framework for high-performance human action recognition.
The core idea of this module is to utilize a trainable graph to aggregate features from the skeleton stream with that of the pose stream, which leads to a network with more robust feature representation ability.
arXiv Detail & Related papers (2022-10-10T02:08:49Z) - SpatioTemporal Focus for Skeleton-based Action Recognition [66.8571926307011]
Graph convolutional networks (GCNs) are widely adopted in skeleton-based action recognition.
We argue that the performance of recent proposed skeleton-based action recognition methods is limited by the following factors.
Inspired by the recent attention mechanism, we propose a multi-grain contextual focus module, termed MCF, to capture the action associated relation information.
arXiv Detail & Related papers (2022-03-31T02:45:24Z) - Joint-bone Fusion Graph Convolutional Network for Semi-supervised
Skeleton Action Recognition [65.78703941973183]
We propose a novel correlation-driven joint-bone fusion graph convolutional network (CD-JBF-GCN) as an encoder and use a pose prediction head as a decoder.
Specifically, the CD-JBF-GC can explore the motion transmission between the joint stream and the bone stream.
The pose prediction based auto-encoder in the self-supervised training stage allows the network to learn motion representation from unlabeled data.
arXiv Detail & Related papers (2022-02-08T16:03:15Z) - Action Recognition with Domain Invariant Features of Skeleton Image [25.519217340328442]
We propose a novel CNN-based method with adversarial training for action recognition.
We introduce a two-level domain adversarial learning to align the features of skeleton images from different view angles or subjects.
It achieves competitive results compared with state-of-the-art methods.
arXiv Detail & Related papers (2021-11-19T08:05:54Z) - HAN: An Efficient Hierarchical Self-Attention Network for Skeleton-Based
Gesture Recognition [73.64451471862613]
We propose an efficient hierarchical self-attention network (HAN) for skeleton-based gesture recognition.
Joint self-attention module is used to capture spatial features of fingers, the finger self-attention module is designed to aggregate features of the whole hand.
Experiments show that our method achieves competitive results on three gesture recognition datasets with much lower computational complexity.
arXiv Detail & Related papers (2021-06-25T02:15:53Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - Spatio-Temporal Inception Graph Convolutional Networks for
Skeleton-Based Action Recognition [126.51241919472356]
We design a simple and highly modularized graph convolutional network architecture for skeleton-based action recognition.
Our network is constructed by repeating a building block that aggregates multi-granularity information from both the spatial and temporal paths.
arXiv Detail & Related papers (2020-11-26T14:43:04Z) - Skeleton-based Action Recognition via Spatial and Temporal Transformer
Networks [12.06555892772049]
We propose a novel Spatial-Temporal Transformer network (ST-TR) which models dependencies between joints using the Transformer self-attention operator.
The proposed ST-TR achieves state-of-the-art performance on all datasets when using joints' coordinates as input, and results on-par with state-of-the-art when adding bones information.
arXiv Detail & Related papers (2020-08-17T15:25:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.