Spatial Temporal Transformer Network for Skeleton-based Action
Recognition
- URL: http://arxiv.org/abs/2012.06399v1
- Date: Fri, 11 Dec 2020 14:58:21 GMT
- Title: Spatial Temporal Transformer Network for Skeleton-based Action
Recognition
- Authors: Chiara Plizzari, Marco Cannici, Matteo Matteucci
- Abstract summary: We propose a novel Spatial-Temporal Transformer network (ST-TR) which models dependencies between joints.
In our ST-TR model, a Spatial Self-Attention module (SSA) is used to understand intra-frame interactions between different body parts, and a Temporal Self-Attention module (TSA) to model inter-frame correlations.
The two are combined in a two-stream network which outperforms state-of-the-art models using the same input data.
- Score: 12.117737635879037
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Skeleton-based human action recognition has achieved a great interest in
recent years, as skeleton data has been demonstrated to be robust to
illumination changes, body scales, dynamic camera views, and complex
background. Nevertheless, an effective encoding of the latent information
underlying the 3D skeleton is still an open problem. In this work, we propose a
novel Spatial-Temporal Transformer network (ST-TR) which models dependencies
between joints using the Transformer self-attention operator. In our ST-TR
model, a Spatial Self-Attention module (SSA) is used to understand intra-frame
interactions between different body parts, and a Temporal Self-Attention module
(TSA) to model inter-frame correlations. The two are combined in a two-stream
network which outperforms state-of-the-art models using the same input data on
both NTU-RGB+D 60 and NTU-RGB+D 120.
Related papers
- A Two-stream Hybrid CNN-Transformer Network for Skeleton-based Human
Interaction Recognition [6.490564374810672]
We propose a Two-stream Hybrid CNN-Transformer Network (THCT-Net)
It exploits the local specificity of CNN and models global dependencies through the Transformer.
We show that the proposed method can better comprehend and infer the meaning and context of various actions, outperforming state-of-the-art methods.
arXiv Detail & Related papers (2023-12-31T06:46:46Z) - UniTR: A Unified and Efficient Multi-Modal Transformer for
Bird's-Eye-View Representation [113.35352122662752]
We present an efficient multi-modal backbone for outdoor 3D perception named UniTR.
UniTR processes a variety of modalities with unified modeling and shared parameters.
UniTR is also a fundamentally task-agnostic backbone that naturally supports different 3D perception tasks.
arXiv Detail & Related papers (2023-08-15T12:13:44Z) - STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition [50.064502884594376]
We study the problem of human action recognition using motion capture (MoCap) sequences.
We propose a novel Spatial-Temporal Mesh Transformer (STMT) to directly model the mesh sequences.
The proposed method achieves state-of-the-art performance compared to skeleton-based and point-cloud-based models.
arXiv Detail & Related papers (2023-03-31T16:19:27Z) - Global-to-Local Modeling for Video-based 3D Human Pose and Shape
Estimation [53.04781510348416]
Video-based 3D human pose and shape estimations are evaluated by intra-frame accuracy and inter-frame smoothness.
We propose to structurally decouple the modeling of long-term and short-term correlations in an end-to-end framework, Global-to-Local Transformer (GLoT)
Our GLoT surpasses previous state-of-the-art methods with the lowest model parameters on popular benchmarks, i.e., 3DPW, MPI-INF-3DHP, and Human3.6M.
arXiv Detail & Related papers (2023-03-26T14:57:49Z) - Ret3D: Rethinking Object Relations for Efficient 3D Object Detection in
Driving Scenes [82.4186966781934]
We introduce a simple, efficient, and effective two-stage detector, termed as Ret3D.
At the core of Ret3D is the utilization of novel intra-frame and inter-frame relation modules.
With negligible extra overhead, Ret3D achieves the state-of-the-art performance.
arXiv Detail & Related papers (2022-08-18T03:48:58Z) - Temporal Transformer Networks with Self-Supervision for Action
Recognition [13.00827959393591]
We introduce a startling Temporal Transformer Network with Self-supervision (TTSN)
TTSN consists of a temporal transformer module and a temporal sequence self-supervision module.
Our proposed TTSN is promising as it successfully achieves state-of-the-art performance for action recognition.
arXiv Detail & Related papers (2021-12-14T12:53:53Z) - Multi-Scale Semantics-Guided Neural Networks for Efficient
Skeleton-Based Human Action Recognition [140.18376685167857]
A simple yet effective multi-scale semantics-guided neural network is proposed for skeleton-based action recognition.
MS-SGN achieves the state-of-the-art performance on the NTU60, NTU120, and SYSU datasets.
arXiv Detail & Related papers (2021-11-07T03:50:50Z) - IIP-Transformer: Intra-Inter-Part Transformer for Skeleton-Based Action
Recognition [0.5953569982292298]
We propose a novel Transformer-based network (IIP-Transformer) for skeleton-based action recognition tasks.
Instead of exploiting interactions among individual joints, our IIP-Transformer incorporates body joints and parts interactions simultaneously.
The proposed IIP-Transformer achieves the-state-of-art performance with more than 8x less computational complexity than DSTA-Net.
arXiv Detail & Related papers (2021-10-26T03:24:22Z) - Skeleton-based Action Recognition via Spatial and Temporal Transformer
Networks [12.06555892772049]
We propose a novel Spatial-Temporal Transformer network (ST-TR) which models dependencies between joints using the Transformer self-attention operator.
The proposed ST-TR achieves state-of-the-art performance on all datasets when using joints' coordinates as input, and results on-par with state-of-the-art when adding bones information.
arXiv Detail & Related papers (2020-08-17T15:25:40Z) - TAM: Temporal Adaptive Module for Video Recognition [60.83208364110288]
temporal adaptive module (bf TAM) generates video-specific temporal kernels based on its own feature map.
Experiments on Kinetics-400 and Something-Something datasets demonstrate that our TAM outperforms other temporal modeling methods consistently.
arXiv Detail & Related papers (2020-05-14T08:22:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.