3Mformer: Multi-order Multi-mode Transformer for Skeletal Action
Recognition
- URL: http://arxiv.org/abs/2303.14474v1
- Date: Sat, 25 Mar 2023 14:06:31 GMT
- Title: 3Mformer: Multi-order Multi-mode Transformer for Skeletal Action
Recognition
- Authors: Lei Wang and Piotr Koniusz
- Abstract summary: Many skeletal action recognition models use GCNs to represent the human body by 3D body joints connected body parts.
We propose to form hypergraph to model hyper-edges between graph nodes.
Our end-to-end trainable network yields state-of-the-art results compared to GCN-, transformer- and hypergraph-based counterparts.
- Score: 38.27785891922479
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many skeletal action recognition models use GCNs to represent the human body
by 3D body joints connected body parts. GCNs aggregate one- or few-hop graph
neighbourhoods, and ignore the dependency between not linked body joints. We
propose to form hypergraph to model hyper-edges between graph nodes (e.g.,
third- and fourth-order hyper-edges capture three and four nodes) which help
capture higher-order motion patterns of groups of body joints. We split action
sequences into temporal blocks, Higher-order Transformer (HoT) produces
embeddings of each temporal block based on (i) the body joints, (ii) pairwise
links of body joints and (iii) higher-order hyper-edges of skeleton body
joints. We combine such HoT embeddings of hyper-edges of orders 1, ..., r by a
novel Multi-order Multi-mode Transformer (3Mformer) with two modules whose
order can be exchanged to achieve coupled-mode attention on coupled-mode tokens
based on 'channel-temporal block', 'order-channel-body joint',
'channel-hyper-edge (any order)' and 'channel-only' pairs. The first module,
called Multi-order Pooling (MP), additionally learns weighted aggregation along
the hyper-edge mode, whereas the second module, Temporal block Pooling (TP),
aggregates along the temporal block mode. Our end-to-end trainable network
yields state-of-the-art results compared to GCN-, transformer- and
hypergraph-based counterparts.
Related papers
- Adaptive Hyper-Graph Convolution Network for Skeleton-based Human Action Recognition with Virtual Connections [32.87473930173842]
We propose an adaptive hyper-graph convolutional network (Hyper-GCN) for action recognition.<n>In particular, our Hyper-GCN adaptively optimises the hyper-graphs during training, revealing the action-driven multi-vertex relations.<n>By injecting virtual connections into hyper-graphs, the semantic clues of diverse action categories can be highlighted.
arXiv Detail & Related papers (2024-11-22T08:41:33Z) - Regular Splitting Graph Network for 3D Human Pose Estimation [5.177947445379688]
We introduce a higher-order regular splitting graph network (RS-Net) for 2D-to-3D human pose estimation.
Our model achieves superior performance over recent state-of-the-art methods for 3D human pose estimation.
arXiv Detail & Related papers (2023-05-09T22:13:04Z) - Temporal-Viewpoint Transportation Plan for Skeletal Few-shot Action
Recognition [38.27785891922479]
Few-shot learning pipeline for 3D skeleton-based action recognition by Joint tEmporal and cAmera viewpoiNt alIgnmEnt.
arXiv Detail & Related papers (2022-10-30T11:46:38Z) - DG-STGCN: Dynamic Spatial-Temporal Modeling for Skeleton-based Action
Recognition [77.87404524458809]
We propose a new framework for skeleton-based action recognition, namely Dynamic Group Spatio-Temporal GCN (DG-STGCN)
It consists of two modules, DG-GCN and DG-TCN, respectively, for spatial and temporal modeling.
DG-STGCN consistently outperforms state-of-the-art methods, often by a notable margin.
arXiv Detail & Related papers (2022-10-12T03:17:37Z) - Multi-Scale Spatial Temporal Graph Convolutional Network for
Skeleton-Based Action Recognition [13.15374205970988]
We present a multi-scale spatial graph convolution (MS-GC) module and a multi-scale temporal graph convolution (MT-GC) module.
The MS-GC and MT-GC modules decompose the corresponding local graph convolution into a set of sub-graph convolutions, forming a hierarchical residual architecture.
We propose a multi-scale spatial temporal graph convolutional network (MST-GCN), which stacks multiple blocks to learn effective motion representations for action recognition.
arXiv Detail & Related papers (2022-06-27T03:17:33Z) - 3D Skeleton-based Few-shot Action Recognition with JEANIE is not so
Na\"ive [28.720272938306692]
We propose a Few-shot Learning pipeline for 3D skeleton-based action recognition by Joint tEmporal and cAmera viewpoiNt alIgnmEnt.
arXiv Detail & Related papers (2021-12-23T16:09:23Z) - Multi-Scale Semantics-Guided Neural Networks for Efficient
Skeleton-Based Human Action Recognition [140.18376685167857]
A simple yet effective multi-scale semantics-guided neural network is proposed for skeleton-based action recognition.
MS-SGN achieves the state-of-the-art performance on the NTU60, NTU120, and SYSU datasets.
arXiv Detail & Related papers (2021-11-07T03:50:50Z) - NeuroMorph: Unsupervised Shape Interpolation and Correspondence in One
Go [109.88509362837475]
We present NeuroMorph, a new neural network architecture that takes as input two 3D shapes.
NeuroMorph produces smooth and point-to-point correspondences between them.
It works well for a large variety of input shapes, including non-isometric pairs from different object categories.
arXiv Detail & Related papers (2021-06-17T12:25:44Z) - Tensor Representations for Action Recognition [54.710267354274194]
Human actions in sequences are characterized by the complex interplay between spatial features and their temporal dynamics.
We propose novel tensor representations for capturing higher-order relationships between visual features for the task of action recognition.
We use higher-order tensors and so-called Eigenvalue Power Normalization (NEP) which have been long speculated to perform spectral detection of higher-order occurrences.
arXiv Detail & Related papers (2020-12-28T17:27:18Z) - Disentangling and Unifying Graph Convolutions for Skeleton-Based Action
Recognition [79.33539539956186]
We propose a simple method to disentangle multi-scale graph convolutions and a unified spatial-temporal graph convolutional operator named G3D.
By coupling these proposals, we develop a powerful feature extractor named MS-G3D based on which our model outperforms previous state-of-the-art methods on three large-scale datasets.
arXiv Detail & Related papers (2020-03-31T11:28:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.