Decoupled Spatial-Temporal Attention Network for Skeleton-Based Action
Recognition
- URL: http://arxiv.org/abs/2007.03263v1
- Date: Tue, 7 Jul 2020 07:58:56 GMT
- Title: Decoupled Spatial-Temporal Attention Network for Skeleton-Based Action
Recognition
- Authors: Lei Shi, Yifan Zhang, Jian Cheng and Hanqing Lu
- Abstract summary: We present a novel decoupled spatial-temporal attention network(DSTA-Net) for skeleton-based action recognition.
Three techniques are proposed for building attention blocks, namely, spatial-temporal attention decoupling, decoupled position encoding and spatial global regularization.
To test the effectiveness of the proposed method, extensive experiments are conducted on four challenging datasets for skeleton-based gesture and action recognition.
- Score: 46.836815779215456
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Dynamic skeletal data, represented as the 2D/3D coordinates of human joints,
has been widely studied for human action recognition due to its high-level
semantic information and environmental robustness. However, previous methods
heavily rely on designing hand-crafted traversal rules or graph topologies to
draw dependencies between the joints, which are limited in performance and
generalizability. In this work, we present a novel decoupled spatial-temporal
attention network(DSTA-Net) for skeleton-based action recognition. It involves
solely the attention blocks, allowing for modeling spatial-temporal
dependencies between joints without the requirement of knowing their positions
or mutual connections. Specifically, to meet the specific requirements of the
skeletal data, three techniques are proposed for building attention blocks,
namely, spatial-temporal attention decoupling, decoupled position encoding and
spatial global regularization. Besides, from the data aspect, we introduce a
skeletal data decoupling technique to emphasize the specific characteristics of
space/time and different motion scales, resulting in a more comprehensive
understanding of the human actions.To test the effectiveness of the proposed
method, extensive experiments are conducted on four challenging datasets for
skeleton-based gesture and action recognition, namely, SHREC, DHG, NTU-60 and
NTU-120, where DSTA-Net achieves state-of-the-art performance on all of them.
Related papers
- Joint-bone Fusion Graph Convolutional Network for Semi-supervised
Skeleton Action Recognition [65.78703941973183]
We propose a novel correlation-driven joint-bone fusion graph convolutional network (CD-JBF-GCN) as an encoder and use a pose prediction head as a decoder.
Specifically, the CD-JBF-GC can explore the motion transmission between the joint stream and the bone stream.
The pose prediction based auto-encoder in the self-supervised training stage allows the network to learn motion representation from unlabeled data.
arXiv Detail & Related papers (2022-02-08T16:03:15Z) - Learning Multi-Granular Spatio-Temporal Graph Network for Skeleton-based
Action Recognition [49.163326827954656]
We propose a novel multi-granular-temporal graph network for skeleton-based action classification.
We develop a dual-head graph network consisting of two inter-leaved branches, which enables us to extract at least two-temporal resolutions.
We conduct extensive experiments on three large-scale datasets.
arXiv Detail & Related papers (2021-08-10T09:25:07Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - On the spatial attention in Spatio-Temporal Graph Convolutional Networks
for skeleton-based human action recognition [97.14064057840089]
Graphal networks (GCNs) promising performance in skeleton-based human action recognition by modeling a sequence of skeletons as a graph.
Most of the recently proposed G-temporal-based methods improve the performance by learning the graph structure at each layer of the network.
arXiv Detail & Related papers (2020-11-07T19:03:04Z) - Skeleton-based Action Recognition via Spatial and Temporal Transformer
Networks [12.06555892772049]
We propose a novel Spatial-Temporal Transformer network (ST-TR) which models dependencies between joints using the Transformer self-attention operator.
The proposed ST-TR achieves state-of-the-art performance on all datasets when using joints' coordinates as input, and results on-par with state-of-the-art when adding bones information.
arXiv Detail & Related papers (2020-08-17T15:25:40Z) - What and Where: Modeling Skeletons from Semantic and Spatial
Perspectives for Action Recognition [46.836815779215456]
We propose to model skeletons from a novel spatial perspective, from which the model takes the spatial location as prior knowledge to group human joints.
From the semantic perspective, we propose a Transformer-like network that is expert in modeling joint correlations.
From the spatial perspective, we transform the skeleton data into the sparse format for efficient feature extraction.
arXiv Detail & Related papers (2020-04-07T10:53:45Z) - Disentangling and Unifying Graph Convolutions for Skeleton-Based Action
Recognition [79.33539539956186]
We propose a simple method to disentangle multi-scale graph convolutions and a unified spatial-temporal graph convolutional operator named G3D.
By coupling these proposals, we develop a powerful feature extractor named MS-G3D based on which our model outperforms previous state-of-the-art methods on three large-scale datasets.
arXiv Detail & Related papers (2020-03-31T11:28:25Z) - A Graph Attention Spatio-temporal Convolutional Network for 3D Human
Pose Estimation in Video [7.647599484103065]
We improve the learning of constraints in human skeleton by modeling local global spatial information via attention mechanisms.
Our approach effectively mitigates depth ambiguity and self-occlusion, generalizes to half upper body estimation, and achieves competitive performance on 2D-to-3D video pose estimation.
arXiv Detail & Related papers (2020-03-11T14:54:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.