Focal and Global Spatial-Temporal Transformer for Skeleton-based Action
Recognition
- URL: http://arxiv.org/abs/2210.02693v1
- Date: Thu, 6 Oct 2022 05:57:15 GMT
- Title: Focal and Global Spatial-Temporal Transformer for Skeleton-based Action
Recognition
- Authors: Zhimin Gao, Peitao Wang, Pei Lv, Xiaoheng Jiang, Qidong Liu, Pichao
Wang, Mingliang Xu and Wanqing Li
- Abstract summary: We propose a novel Focal and Global Spatial-Temporal Transformer network (FG-STFormer)
It is equipped with two key components: (1) FG-SFormer: focal joints and global parts coupling spatial transformer.
Dilated temporal convolution is integrated into the global self-attention mechanism to explicitly capture the local temporal motion patterns of joints or body parts.
- Score: 34.38874828210301
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite great progress achieved by transformer in various vision tasks, it is
still underexplored for skeleton-based action recognition with only a few
attempts. Besides, these methods directly calculate the pair-wise global
self-attention equally for all the joints in both the spatial and temporal
dimensions, undervaluing the effect of discriminative local joints and the
short-range temporal dynamics. In this work, we propose a novel Focal and
Global Spatial-Temporal Transformer network (FG-STFormer), that is equipped
with two key components: (1) FG-SFormer: focal joints and global parts coupling
spatial transformer. It forces the network to focus on modelling correlations
for both the learned discriminative spatial joints and human body parts
respectively. The selective focal joints eliminate the negative effect of
non-informative ones during accumulating the correlations. Meanwhile, the
interactions between the focal joints and body parts are incorporated to
enhance the spatial dependencies via mutual cross-attention. (2) FG-TFormer:
focal and global temporal transformer. Dilated temporal convolution is
integrated into the global self-attention mechanism to explicitly capture the
local temporal motion patterns of joints or body parts, which is found to be
vital important to make temporal transformer work. Extensive experimental
results on three benchmarks, namely NTU-60, NTU-120 and NW-UCLA, show our
FG-STFormer surpasses all existing transformer-based methods, and compares
favourably with state-of-the art GCN-based methods.
Related papers
- Cross Paradigm Representation and Alignment Transformer for Image Deraining [40.66823807648992]
We propose a novel Cross Paradigm Representation and Alignment Transformer (CPRAformer)
Its core idea is the hierarchical representation and alignment, leveraging the strengths of both paradigms to aid image reconstruction.
We use two types of self-attention in the Transformer blocks: sparse prompt channel self-attention (SPC-SA) and spatial pixel refinement self-attention (SPR-SA)
arXiv Detail & Related papers (2025-04-23T06:44:46Z) - Fourier Test-time Adaptation with Multi-level Consistency for Robust
Classification [10.291631977766672]
We propose a novel approach called Fourier Test-time Adaptation (FTTA) to integrate input and model tuning.
FTTA builds a reliable multi-level consistency measurement of paired inputs for achieving self-supervised of prediction.
It was extensively validated on three large classification datasets with different modalities and organs.
arXiv Detail & Related papers (2023-06-05T02:29:38Z) - Global-to-Local Modeling for Video-based 3D Human Pose and Shape
Estimation [53.04781510348416]
Video-based 3D human pose and shape estimations are evaluated by intra-frame accuracy and inter-frame smoothness.
We propose to structurally decouple the modeling of long-term and short-term correlations in an end-to-end framework, Global-to-Local Transformer (GLoT)
Our GLoT surpasses previous state-of-the-art methods with the lowest model parameters on popular benchmarks, i.e., 3DPW, MPI-INF-3DHP, and Human3.6M.
arXiv Detail & Related papers (2023-03-26T14:57:49Z) - PhysFormer++: Facial Video-based Physiological Measurement with SlowFast
Temporal Difference Transformer [76.40106756572644]
Recent deep learning approaches focus on mining subtle clues using convolutional neural networks with limited-temporal receptive fields.
In this paper, we propose two end-to-end video transformer based on PhysFormer and Phys++++, to adaptively aggregate both local and global features for r representation enhancement.
Comprehensive experiments are performed on four benchmark datasets to show our superior performance on both intra-temporal and cross-dataset testing.
arXiv Detail & Related papers (2023-02-07T15:56:03Z) - Global-local Motion Transformer for Unsupervised Skeleton-based Action
Learning [23.051184131833292]
We propose a new transformer model for the task of unsupervised learning of skeleton motion sequences.
The proposed model successfully learns local dynamics of the joints and captures global context from the motion sequences.
arXiv Detail & Related papers (2022-07-13T10:18:07Z) - Interaction Transformer for Human Reaction Generation [61.22481606720487]
We propose a novel interaction Transformer (InterFormer) consisting of a Transformer network with both temporal and spatial attentions.
Our method is general and can be used to generate more complex and long-term interactions.
arXiv Detail & Related papers (2022-07-04T19:30:41Z) - Disentangling Spatial-Temporal Functional Brain Networks via
Twin-Transformers [12.137308815848717]
How to identify and characterize functional brain networks (BN) is fundamental to gain system-level insights into the mechanisms of brain organization architecture.
We propose a novel Twin-Transformers framework to simultaneously infer common and individual functional networks in both spatial and temporal space.
arXiv Detail & Related papers (2022-04-20T04:57:53Z) - SpatioTemporal Focus for Skeleton-based Action Recognition [66.8571926307011]
Graph convolutional networks (GCNs) are widely adopted in skeleton-based action recognition.
We argue that the performance of recent proposed skeleton-based action recognition methods is limited by the following factors.
Inspired by the recent attention mechanism, we propose a multi-grain contextual focus module, termed MCF, to capture the action associated relation information.
arXiv Detail & Related papers (2022-03-31T02:45:24Z) - PhysFormer: Facial Video-based Physiological Measurement with Temporal
Difference Transformer [55.936527926778695]
Recent deep learning approaches focus on mining subtle r clues using convolutional neural networks with limited-temporal receptive fields.
In this paper, we propose the PhysFormer, an end-to-end video transformer based architecture.
arXiv Detail & Related papers (2021-11-23T18:57:11Z) - An Attractor-Guided Neural Networks for Skeleton-Based Human Motion
Prediction [0.4568777157687961]
Joint modeling is a curial component in human motion prediction.
We learn a medium, called balance attractor (BA), fromtemporal features to characterize the global motion features.
Through the BA, all joints are related synchronously, and thus the global coordination of all joints can be better learned.
arXiv Detail & Related papers (2021-05-20T12:51:39Z) - Spatial Temporal Transformer Network for Skeleton-based Action
Recognition [12.117737635879037]
We propose a novel Spatial-Temporal Transformer network (ST-TR) which models dependencies between joints.
In our ST-TR model, a Spatial Self-Attention module (SSA) is used to understand intra-frame interactions between different body parts, and a Temporal Self-Attention module (TSA) to model inter-frame correlations.
The two are combined in a two-stream network which outperforms state-of-the-art models using the same input data.
arXiv Detail & Related papers (2020-12-11T14:58:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.