Related papers: Focal and Global Spatial-Temporal Transformer for Skeleton-based Action Recognition

Focal and Global Spatial-Temporal Transformer for Skeleton-based Action Recognition

URL: http://arxiv.org/abs/2210.02693v1
Date: Thu, 6 Oct 2022 05:57:15 GMT
Title: Focal and Global Spatial-Temporal Transformer for Skeleton-based Action Recognition
Authors: Zhimin Gao, Peitao Wang, Pei Lv, Xiaoheng Jiang, Qidong Liu, Pichao Wang, Mingliang Xu and Wanqing Li
Abstract summary: We propose a novel Focal and Global Spatial-Temporal Transformer network (FG-STFormer) It is equipped with two key components: (1) FG-SFormer: focal joints and global parts coupling spatial transformer. Dilated temporal convolution is integrated into the global self-attention mechanism to explicitly capture the local temporal motion patterns of joints or body parts.
Score: 34.38874828210301
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite great progress achieved by transformer in various vision tasks, it is still underexplored for skeleton-based action recognition with only a few attempts. Besides, these methods directly calculate the pair-wise global self-attention equally for all the joints in both the spatial and temporal dimensions, undervaluing the effect of discriminative local joints and the short-range temporal dynamics. In this work, we propose a novel Focal and Global Spatial-Temporal Transformer network (FG-STFormer), that is equipped with two key components: (1) FG-SFormer: focal joints and global parts coupling spatial transformer. It forces the network to focus on modelling correlations for both the learned discriminative spatial joints and human body parts respectively. The selective focal joints eliminate the negative effect of non-informative ones during accumulating the correlations. Meanwhile, the interactions between the focal joints and body parts are incorporated to enhance the spatial dependencies via mutual cross-attention. (2) FG-TFormer: focal and global temporal transformer. Dilated temporal convolution is integrated into the global self-attention mechanism to explicitly capture the local temporal motion patterns of joints or body parts, which is found to be vital important to make temporal transformer work. Extensive experimental results on three benchmarks, namely NTU-60, NTU-120 and NW-UCLA, show our FG-STFormer surpasses all existing transformer-based methods, and compares favourably with state-of-the art GCN-based methods.

Related papers

Cross Paradigm Representation and Alignment Transformer for Image Deraining [40.66823807648992]
We propose a novel Cross Paradigm Representation and Alignment Transformer (CPRAformer) Its core idea is the hierarchical representation and alignment, leveraging the strengths of both paradigms to aid image reconstruction. We use two types of self-attention in the Transformer blocks: sparse prompt channel self-attention (SPC-SA) and spatial pixel refinement self-attention (SPR-SA)
arXiv Detail & Related papers (2025-04-23T06:44:46Z)
Fourier Test-time Adaptation with Multi-level Consistency for Robust Classification [10.291631977766672]
We propose a novel approach called Fourier Test-time Adaptation (FTTA) to integrate input and model tuning. FTTA builds a reliable multi-level consistency measurement of paired inputs for achieving self-supervised of prediction. It was extensively validated on three large classification datasets with different modalities and organs.
arXiv Detail & Related papers (2023-06-05T02:29:38Z)
Global-to-Local Modeling for Video-based 3D Human Pose and Shape Estimation [53.04781510348416]
Video-based 3D human pose and shape estimations are evaluated by intra-frame accuracy and inter-frame smoothness. We propose to structurally decouple the modeling of long-term and short-term correlations in an end-to-end framework, Global-to-Local Transformer (GLoT) Our GLoT surpasses previous state-of-the-art methods with the lowest model parameters on popular benchmarks, i.e., 3DPW, MPI-INF-3DHP, and Human3.6M.
arXiv Detail & Related papers (2023-03-26T14:57:49Z)
PhysFormer++: Facial Video-based Physiological Measurement with SlowFast Temporal Difference Transformer [76.40106756572644]
Recent deep learning approaches focus on mining subtle clues using convolutional neural networks with limited-temporal receptive fields. In this paper, we propose two end-to-end video transformer based on PhysFormer and Phys++++, to adaptively aggregate both local and global features for r representation enhancement. Comprehensive experiments are performed on four benchmark datasets to show our superior performance on both intra-temporal and cross-dataset testing.
arXiv Detail & Related papers (2023-02-07T15:56:03Z)
Global-local Motion Transformer for Unsupervised Skeleton-based Action Learning [23.051184131833292]
We propose a new transformer model for the task of unsupervised learning of skeleton motion sequences. The proposed model successfully learns local dynamics of the joints and captures global context from the motion sequences.
arXiv Detail & Related papers (2022-07-13T10:18:07Z)
Interaction Transformer for Human Reaction Generation [61.22481606720487]
We propose a novel interaction Transformer (InterFormer) consisting of a Transformer network with both temporal and spatial attentions. Our method is general and can be used to generate more complex and long-term interactions.
arXiv Detail & Related papers (2022-07-04T19:30:41Z)
Disentangling Spatial-Temporal Functional Brain Networks via Twin-Transformers [12.137308815848717]
How to identify and characterize functional brain networks (BN) is fundamental to gain system-level insights into the mechanisms of brain organization architecture. We propose a novel Twin-Transformers framework to simultaneously infer common and individual functional networks in both spatial and temporal space.
arXiv Detail & Related papers (2022-04-20T04:57:53Z)
SpatioTemporal Focus for Skeleton-based Action Recognition [66.8571926307011]
Graph convolutional networks (GCNs) are widely adopted in skeleton-based action recognition. We argue that the performance of recent proposed skeleton-based action recognition methods is limited by the following factors. Inspired by the recent attention mechanism, we propose a multi-grain contextual focus module, termed MCF, to capture the action associated relation information.
arXiv Detail & Related papers (2022-03-31T02:45:24Z)
PhysFormer: Facial Video-based Physiological Measurement with Temporal Difference Transformer [55.936527926778695]
Recent deep learning approaches focus on mining subtle r clues using convolutional neural networks with limited-temporal receptive fields. In this paper, we propose the PhysFormer, an end-to-end video transformer based architecture.
arXiv Detail & Related papers (2021-11-23T18:57:11Z)
An Attractor-Guided Neural Networks for Skeleton-Based Human Motion Prediction [0.4568777157687961]
Joint modeling is a curial component in human motion prediction. We learn a medium, called balance attractor (BA), fromtemporal features to characterize the global motion features. Through the BA, all joints are related synchronously, and thus the global coordination of all joints can be better learned.
arXiv Detail & Related papers (2021-05-20T12:51:39Z)
Spatial Temporal Transformer Network for Skeleton-based Action Recognition [12.117737635879037]
We propose a novel Spatial-Temporal Transformer network (ST-TR) which models dependencies between joints. In our ST-TR model, a Spatial Self-Attention module (SSA) is used to understand intra-frame interactions between different body parts, and a Temporal Self-Attention module (TSA) to model inter-frame correlations. The two are combined in a two-stream network which outperforms state-of-the-art models using the same input data.
arXiv Detail & Related papers (2020-12-11T14:58:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.