Related papers: 3Mformer: Multi-order Multi-mode Transformer for Skeletal Action Recognition

3Mformer: Multi-order Multi-mode Transformer for Skeletal Action Recognition

URL: http://arxiv.org/abs/2303.14474v1
Date: Sat, 25 Mar 2023 14:06:31 GMT
Title: 3Mformer: Multi-order Multi-mode Transformer for Skeletal Action Recognition
Authors: Lei Wang and Piotr Koniusz
Abstract summary: Many skeletal action recognition models use GCNs to represent the human body by 3D body joints connected body parts. We propose to form hypergraph to model hyper-edges between graph nodes. Our end-to-end trainable network yields state-of-the-art results compared to GCN-, transformer- and hypergraph-based counterparts.
Score: 38.27785891922479
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Many skeletal action recognition models use GCNs to represent the human body by 3D body joints connected body parts. GCNs aggregate one- or few-hop graph neighbourhoods, and ignore the dependency between not linked body joints. We propose to form hypergraph to model hyper-edges between graph nodes (e.g., third- and fourth-order hyper-edges capture three and four nodes) which help capture higher-order motion patterns of groups of body joints. We split action sequences into temporal blocks, Higher-order Transformer (HoT) produces embeddings of each temporal block based on (i) the body joints, (ii) pairwise links of body joints and (iii) higher-order hyper-edges of skeleton body joints. We combine such HoT embeddings of hyper-edges of orders 1, ..., r by a novel Multi-order Multi-mode Transformer (3Mformer) with two modules whose order can be exchanged to achieve coupled-mode attention on coupled-mode tokens based on 'channel-temporal block', 'order-channel-body joint', 'channel-hyper-edge (any order)' and 'channel-only' pairs. The first module, called Multi-order Pooling (MP), additionally learns weighted aggregation along the hyper-edge mode, whereas the second module, Temporal block Pooling (TP), aggregates along the temporal block mode. Our end-to-end trainable network yields state-of-the-art results compared to GCN-, transformer- and hypergraph-based counterparts.

Related papers

Adaptive Hyper-Graph Convolution Network for Skeleton-based Human Action Recognition with Virtual Connections [32.87473930173842]
We propose an adaptive hyper-graph convolutional network (Hyper-GCN) for action recognition.<n>In particular, our Hyper-GCN adaptively optimises the hyper-graphs during training, revealing the action-driven multi-vertex relations.<n>By injecting virtual connections into hyper-graphs, the semantic clues of diverse action categories can be highlighted.
arXiv Detail & Related papers (2024-11-22T08:41:33Z)
Regular Splitting Graph Network for 3D Human Pose Estimation [5.177947445379688]
We introduce a higher-order regular splitting graph network (RS-Net) for 2D-to-3D human pose estimation. Our model achieves superior performance over recent state-of-the-art methods for 3D human pose estimation.
arXiv Detail & Related papers (2023-05-09T22:13:04Z)
Temporal-Viewpoint Transportation Plan for Skeletal Few-shot Action Recognition [38.27785891922479]
Few-shot learning pipeline for 3D skeleton-based action recognition by Joint tEmporal and cAmera viewpoiNt alIgnmEnt.
arXiv Detail & Related papers (2022-10-30T11:46:38Z)
DG-STGCN: Dynamic Spatial-Temporal Modeling for Skeleton-based Action Recognition [77.87404524458809]
We propose a new framework for skeleton-based action recognition, namely Dynamic Group Spatio-Temporal GCN (DG-STGCN) It consists of two modules, DG-GCN and DG-TCN, respectively, for spatial and temporal modeling. DG-STGCN consistently outperforms state-of-the-art methods, often by a notable margin.
arXiv Detail & Related papers (2022-10-12T03:17:37Z)
Multi-Scale Spatial Temporal Graph Convolutional Network for Skeleton-Based Action Recognition [13.15374205970988]
We present a multi-scale spatial graph convolution (MS-GC) module and a multi-scale temporal graph convolution (MT-GC) module. The MS-GC and MT-GC modules decompose the corresponding local graph convolution into a set of sub-graph convolutions, forming a hierarchical residual architecture. We propose a multi-scale spatial temporal graph convolutional network (MST-GCN), which stacks multiple blocks to learn effective motion representations for action recognition.
arXiv Detail & Related papers (2022-06-27T03:17:33Z)
3D Skeleton-based Few-shot Action Recognition with JEANIE is not so Na\"ive [28.720272938306692]
We propose a Few-shot Learning pipeline for 3D skeleton-based action recognition by Joint tEmporal and cAmera viewpoiNt alIgnmEnt.
arXiv Detail & Related papers (2021-12-23T16:09:23Z)
Multi-Scale Semantics-Guided Neural Networks for Efficient Skeleton-Based Human Action Recognition [140.18376685167857]
A simple yet effective multi-scale semantics-guided neural network is proposed for skeleton-based action recognition. MS-SGN achieves the state-of-the-art performance on the NTU60, NTU120, and SYSU datasets.
arXiv Detail & Related papers (2021-11-07T03:50:50Z)
NeuroMorph: Unsupervised Shape Interpolation and Correspondence in One Go [109.88509362837475]
We present NeuroMorph, a new neural network architecture that takes as input two 3D shapes. NeuroMorph produces smooth and point-to-point correspondences between them. It works well for a large variety of input shapes, including non-isometric pairs from different object categories.
arXiv Detail & Related papers (2021-06-17T12:25:44Z)
Tensor Representations for Action Recognition [54.710267354274194]
Human actions in sequences are characterized by the complex interplay between spatial features and their temporal dynamics. We propose novel tensor representations for capturing higher-order relationships between visual features for the task of action recognition. We use higher-order tensors and so-called Eigenvalue Power Normalization (NEP) which have been long speculated to perform spectral detection of higher-order occurrences.
arXiv Detail & Related papers (2020-12-28T17:27:18Z)
Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition [79.33539539956186]
We propose a simple method to disentangle multi-scale graph convolutions and a unified spatial-temporal graph convolutional operator named G3D. By coupling these proposals, we develop a powerful feature extractor named MS-G3D based on which our model outperforms previous state-of-the-art methods on three large-scale datasets.
arXiv Detail & Related papers (2020-03-31T11:28:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.