Skeleton-based Action Recognition via Spatial and Temporal Transformer
Networks
- URL: http://arxiv.org/abs/2008.07404v4
- Date: Tue, 22 Jun 2021 15:29:28 GMT
- Title: Skeleton-based Action Recognition via Spatial and Temporal Transformer
Networks
- Authors: Chiara Plizzari, Marco Cannici, Matteo Matteucci
- Abstract summary: We propose a novel Spatial-Temporal Transformer network (ST-TR) which models dependencies between joints using the Transformer self-attention operator.
The proposed ST-TR achieves state-of-the-art performance on all datasets when using joints' coordinates as input, and results on-par with state-of-the-art when adding bones information.
- Score: 12.06555892772049
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Skeleton-based Human Activity Recognition has achieved great interest in
recent years as skeleton data has demonstrated being robust to illumination
changes, body scales, dynamic camera views, and complex background. In
particular, Spatial-Temporal Graph Convolutional Networks (ST-GCN) demonstrated
to be effective in learning both spatial and temporal dependencies on
non-Euclidean data such as skeleton graphs. Nevertheless, an effective encoding
of the latent information underlying the 3D skeleton is still an open problem,
especially when it comes to extracting effective information from joint motion
patterns and their correlations. In this work, we propose a novel
Spatial-Temporal Transformer network (ST-TR) which models dependencies between
joints using the Transformer self-attention operator. In our ST-TR model, a
Spatial Self-Attention module (SSA) is used to understand intra-frame
interactions between different body parts, and a Temporal Self-Attention module
(TSA) to model inter-frame correlations. The two are combined in a two-stream
network, whose performance is evaluated on three large-scale datasets,
NTU-RGB+D 60, NTU-RGB+D 120, and Kinetics Skeleton 400, consistently improving
backbone results. Compared with methods that use the same input data, the
proposed ST-TR achieves state-of-the-art performance on all datasets when using
joints' coordinates as input, and results on-par with state-of-the-art when
adding bones information.
Related papers
- Multi-Scale Spatial-Temporal Self-Attention Graph Convolutional Networks for Skeleton-based Action Recognition [0.0]
In this paper, we propose self-attention GCN hybrid model, Multi-Scale Spatial-Temporal self-attention (MSST)-GCN.
We utilize spatial self-attention module with adaptive topology to understand intra-frame interactions within a frame among different body parts, and temporal self-attention module to examine correlations between frames of a node.
arXiv Detail & Related papers (2024-04-03T10:25:45Z) - Ret3D: Rethinking Object Relations for Efficient 3D Object Detection in
Driving Scenes [82.4186966781934]
We introduce a simple, efficient, and effective two-stage detector, termed as Ret3D.
At the core of Ret3D is the utilization of novel intra-frame and inter-frame relation modules.
With negligible extra overhead, Ret3D achieves the state-of-the-art performance.
arXiv Detail & Related papers (2022-08-18T03:48:58Z) - Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based
Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage.
We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z) - Joint-bone Fusion Graph Convolutional Network for Semi-supervised
Skeleton Action Recognition [65.78703941973183]
We propose a novel correlation-driven joint-bone fusion graph convolutional network (CD-JBF-GCN) as an encoder and use a pose prediction head as a decoder.
Specifically, the CD-JBF-GC can explore the motion transmission between the joint stream and the bone stream.
The pose prediction based auto-encoder in the self-supervised training stage allows the network to learn motion representation from unlabeled data.
arXiv Detail & Related papers (2022-02-08T16:03:15Z) - Dynamic Hypergraph Convolutional Networks for Skeleton-Based Action
Recognition [22.188135882864287]
We propose a novel dynamic hypergraph convolutional networks (DHGCN) for skeleton-based action recognition.
DHGCN uses hypergraph to represent the skeleton structure to effectively exploit the motion information contained in human joints.
arXiv Detail & Related papers (2021-12-20T14:46:14Z) - Multi-Scale Semantics-Guided Neural Networks for Efficient
Skeleton-Based Human Action Recognition [140.18376685167857]
A simple yet effective multi-scale semantics-guided neural network is proposed for skeleton-based action recognition.
MS-SGN achieves the state-of-the-art performance on the NTU60, NTU120, and SYSU datasets.
arXiv Detail & Related papers (2021-11-07T03:50:50Z) - IIP-Transformer: Intra-Inter-Part Transformer for Skeleton-Based Action
Recognition [0.5953569982292298]
We propose a novel Transformer-based network (IIP-Transformer) for skeleton-based action recognition tasks.
Instead of exploiting interactions among individual joints, our IIP-Transformer incorporates body joints and parts interactions simultaneously.
The proposed IIP-Transformer achieves the-state-of-art performance with more than 8x less computational complexity than DSTA-Net.
arXiv Detail & Related papers (2021-10-26T03:24:22Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - Spatial Temporal Transformer Network for Skeleton-based Action
Recognition [12.117737635879037]
We propose a novel Spatial-Temporal Transformer network (ST-TR) which models dependencies between joints.
In our ST-TR model, a Spatial Self-Attention module (SSA) is used to understand intra-frame interactions between different body parts, and a Temporal Self-Attention module (TSA) to model inter-frame correlations.
The two are combined in a two-stream network which outperforms state-of-the-art models using the same input data.
arXiv Detail & Related papers (2020-12-11T14:58:21Z) - Decoupled Spatial-Temporal Attention Network for Skeleton-Based Action
Recognition [46.836815779215456]
We present a novel decoupled spatial-temporal attention network(DSTA-Net) for skeleton-based action recognition.
Three techniques are proposed for building attention blocks, namely, spatial-temporal attention decoupling, decoupled position encoding and spatial global regularization.
To test the effectiveness of the proposed method, extensive experiments are conducted on four challenging datasets for skeleton-based gesture and action recognition.
arXiv Detail & Related papers (2020-07-07T07:58:56Z) - MotioNet: 3D Human Motion Reconstruction from Monocular Video with
Skeleton Consistency [72.82534577726334]
We introduce MotioNet, a deep neural network that directly reconstructs the motion of a 3D human skeleton from monocular video.
Our method is the first data-driven approach that directly outputs a kinematic skeleton, which is a complete, commonly used, motion representation.
arXiv Detail & Related papers (2020-06-22T08:50:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.