Group-Skeleton-Based Human Action Recognition in Complex Events
- URL: http://arxiv.org/abs/2011.13273v2
- Date: Thu, 25 Feb 2021 03:42:32 GMT
- Title: Group-Skeleton-Based Human Action Recognition in Complex Events
- Authors: Tingtian Li, Zixun Sun, Xiao Chen
- Abstract summary: We propose a novel group-skeleton-based human action recognition method in complex events.
This method first utilizes multi-scale spatial-temporal graph convolutional networks (MS-G3Ds) to extract skeleton features from multiple persons.
Results on the HiEve dataset show that our method can give superior performance compared to other state-of-the-art methods.
- Score: 15.649778891665468
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human action recognition as an important application of computer vision has
been studied for decades. Among various approaches, skeleton-based methods
recently attract increasing attention due to their robust and superior
performance. However, existing skeleton-based methods ignore the potential
action relationships between different persons, while the action of a person is
highly likely to be impacted by another person especially in complex events. In
this paper, we propose a novel group-skeleton-based human action recognition
method in complex events. This method first utilizes multi-scale
spatial-temporal graph convolutional networks (MS-G3Ds) to extract skeleton
features from multiple persons. In addition to the traditional key point
coordinates, we also input the key point speed values to the networks for
better performance. Then we use multilayer perceptrons (MLPs) to embed the
distance values between the reference person and other persons into the
extracted features. Lastly, all the features are fed into another MS-G3D for
feature fusion and classification. For avoiding class imbalance problems, the
networks are trained with a focal loss. The proposed algorithm is also our
solution for the Large-scale Human-centric Video Analysis in Complex Events
Challenge. Results on the HiEve dataset show that our method can give superior
performance compared to other state-of-the-art methods.
Related papers
- Improving Video Violence Recognition with Human Interaction Learning on
3D Skeleton Point Clouds [88.87985219999764]
We develop a method for video violence recognition from a new perspective of skeleton points.
We first formulate 3D skeleton point clouds from human sequences extracted from videos.
We then perform interaction learning on these 3D skeleton point clouds.
arXiv Detail & Related papers (2023-08-26T12:55:18Z) - View-Invariant Skeleton-based Action Recognition via Global-Local
Contrastive Learning [15.271862140292837]
We propose a new view-invariant representation learning approach, without any manual action labeling, for skeleton-based human action recognition.
We leverage the multi-view skeleton data simultaneously taken for the same person in the network training, by maximizing the mutual information between the representations extracted from different views.
We show that the proposed method is robust to the view difference of the input skeleton data and significantly boosts the performance of unsupervised skeleton-based human action methods.
arXiv Detail & Related papers (2022-09-23T15:00:57Z) - Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z) - Learning Multi-Granular Spatio-Temporal Graph Network for Skeleton-based
Action Recognition [49.163326827954656]
We propose a novel multi-granular-temporal graph network for skeleton-based action classification.
We develop a dual-head graph network consisting of two inter-leaved branches, which enables us to extract at least two-temporal resolutions.
We conduct extensive experiments on three large-scale datasets.
arXiv Detail & Related papers (2021-08-10T09:25:07Z) - Revisiting Skeleton-based Action Recognition [107.08112310075114]
PoseC3D is a new approach to skeleton-based action recognition, which relies on a 3D heatmap instead stack a graph sequence as the base representation of human skeletons.
On four challenging datasets, PoseC3D consistently obtains superior performance, when used alone on skeletons and in combination with the RGB modality.
arXiv Detail & Related papers (2021-04-28T06:32:17Z) - Differentiable Multi-Granularity Human Representation Learning for
Instance-Aware Human Semantic Parsing [131.97475877877608]
A new bottom-up regime is proposed to learn category-level human semantic segmentation and multi-person pose estimation in a joint and end-to-end manner.
It is a compact, efficient and powerful framework that exploits structural information over different human granularities.
Experiments on three instance-aware human datasets show that our model outperforms other bottom-up alternatives with much more efficient inference.
arXiv Detail & Related papers (2021-03-08T06:55:00Z) - Gesture Recognition from Skeleton Data for Intuitive Human-Machine
Interaction [0.6875312133832077]
We propose an approach for segmentation and classification of dynamic gestures based on a set of handcrafted features.
The method for gesture recognition applies a sliding window, which extracts information from both the spatial and temporal dimensions.
At the end, the recognized gestures are used to interact with a collaborative robot.
arXiv Detail & Related papers (2020-08-26T11:28:50Z) - Towards High Performance Human Keypoint Detection [87.1034745775229]
We find that context information plays an important role in reasoning human body configuration and invisible keypoints.
Inspired by this, we propose a cascaded context mixer ( CCM) which efficiently integrates spatial and channel context information.
To maximize CCM's representation capability, we develop a hard-negative person detection mining strategy and a joint-training strategy.
We present several sub-pixel refinement techniques for postprocessing keypoint predictions to improve detection accuracy.
arXiv Detail & Related papers (2020-02-03T02:24:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.