What and Where: Modeling Skeletons from Semantic and Spatial
Perspectives for Action Recognition
- URL: http://arxiv.org/abs/2004.03259v2
- Date: Mon, 22 Mar 2021 12:31:40 GMT
- Title: What and Where: Modeling Skeletons from Semantic and Spatial
Perspectives for Action Recognition
- Authors: Lei Shi, Yifan Zhang, Jian Cheng and Hanqing Lu
- Abstract summary: We propose to model skeletons from a novel spatial perspective, from which the model takes the spatial location as prior knowledge to group human joints.
From the semantic perspective, we propose a Transformer-like network that is expert in modeling joint correlations.
From the spatial perspective, we transform the skeleton data into the sparse format for efficient feature extraction.
- Score: 46.836815779215456
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Skeleton data, which consists of only the 2D/3D coordinates of the human
joints, has been widely studied for human action recognition. Existing methods
take the semantics as prior knowledge to group human joints and draw
correlations according to their spatial locations, which we call the semantic
perspective for skeleton modeling. In this paper, in contrast to previous
approaches, we propose to model skeletons from a novel spatial perspective,
from which the model takes the spatial location as prior knowledge to group
human joints and mines the discriminative patterns of local areas in a
hierarchical manner. The two perspectives are orthogonal and complementary to
each other; and by fusing them in a unified framework, our method achieves a
more comprehensive understanding of the skeleton data. Besides, we customized
two networks for the two perspectives. From the semantic perspective, we
propose a Transformer-like network that is expert in modeling joint
correlations, and present three effective techniques to adapt it for skeleton
data. From the spatial perspective, we transform the skeleton data into the
sparse format for efficient feature extraction and present two types of sparse
convolutional networks for sparse skeleton modeling. Extensive experiments are
conducted on three challenging datasets for skeleton-based human action/gesture
recognition, namely, NTU-60, NTU-120 and SHREC, where our method achieves
state-of-the-art performance.
Related papers
- GaitMA: Pose-guided Multi-modal Feature Fusion for Gait Recognition [26.721242606715354]
Gait recognition is a biometric technology that recognizes the identity of humans through their walking patterns.
We propose a novel gait recognition framework, dubbed Gait Multi-model Aggregation Network (GaitMA)
First, skeletons are represented by joint/limb-based heatmaps, and features from silhouettes and skeletons are respectively extracted using two CNN-based feature extractors.
arXiv Detail & Related papers (2024-07-20T09:05:17Z) - Unsupervised 3D Pose Estimation with Non-Rigid Structure-from-Motion
Modeling [83.76377808476039]
We propose a new modeling method for human pose deformations and design an accompanying diffusion-based motion prior.
Inspired by the field of non-rigid structure-from-motion, we divide the task of reconstructing 3D human skeletons in motion into the estimation of a 3D reference skeleton.
A mixed spatial-temporal NRSfMformer is used to simultaneously estimate the 3D reference skeleton and the skeleton deformation of each frame from 2D observations sequence.
arXiv Detail & Related papers (2023-08-18T16:41:57Z) - Iterative Graph Filtering Network for 3D Human Pose Estimation [5.177947445379688]
Graph convolutional networks (GCNs) have proven to be an effective approach for 3D human pose estimation.
In this paper, we introduce an iterative graph filtering framework for 3D human pose estimation.
Our approach builds upon the idea of iteratively solving graph filtering with Laplacian regularization.
arXiv Detail & Related papers (2023-07-29T20:46:44Z) - Learning 3D Human Pose Estimation from Dozens of Datasets using a
Geometry-Aware Autoencoder to Bridge Between Skeleton Formats [80.12253291709673]
We propose a novel affine-combining autoencoder (ACAE) method to perform dimensionality reduction on the number of landmarks.
Our approach scales to an extreme multi-dataset regime, where we use 28 3D human pose datasets to supervise one model.
arXiv Detail & Related papers (2022-12-29T22:22:49Z) - Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based
Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage.
We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z) - Skeleton-Contrastive 3D Action Representation Learning [35.06361753065124]
This paper strives for self-supervised learning of a feature space suitable for skeleton-based action recognition.
Our approach achieves state-of-the-art performance for self-supervised learning from skeleton data on the challenging PKU and NTU datasets.
arXiv Detail & Related papers (2021-08-08T14:44:59Z) - Mix Dimension in Poincar\'{e} Geometry for 3D Skeleton-based Action
Recognition [57.98278794950759]
Graph Convolutional Networks (GCNs) have already demonstrated their powerful ability to model the irregular data.
We present a novel spatial-temporal GCN architecture which is defined via the Poincar'e geometry.
We evaluate our method on two current largest scale 3D datasets.
arXiv Detail & Related papers (2020-07-30T18:23:18Z) - Decoupled Spatial-Temporal Attention Network for Skeleton-Based Action
Recognition [46.836815779215456]
We present a novel decoupled spatial-temporal attention network(DSTA-Net) for skeleton-based action recognition.
Three techniques are proposed for building attention blocks, namely, spatial-temporal attention decoupling, decoupled position encoding and spatial global regularization.
To test the effectiveness of the proposed method, extensive experiments are conducted on four challenging datasets for skeleton-based gesture and action recognition.
arXiv Detail & Related papers (2020-07-07T07:58:56Z) - Learning 3D Human Shape and Pose from Dense Body Parts [117.46290013548533]
We propose a Decompose-and-aggregate Network (DaNet) to learn 3D human shape and pose from dense correspondences of body parts.
Messages from local streams are aggregated to enhance the robust prediction of the rotation-based poses.
Our method is validated on both indoor and real-world datasets including Human3.6M, UP3D, COCO, and 3DPW.
arXiv Detail & Related papers (2019-12-31T15:09:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.