Unified Keypoint-based Action Recognition Framework via Structured
Keypoint Pooling
- URL: http://arxiv.org/abs/2303.15270v1
- Date: Mon, 27 Mar 2023 14:59:08 GMT
- Title: Unified Keypoint-based Action Recognition Framework via Structured
Keypoint Pooling
- Authors: Ryo Hachiuma, Fumiaki Sato, Taiki Sekii
- Abstract summary: This paper simultaneously addresses three limitations associated with conventional skeleton-based action recognition.
A point cloud deep-learning paradigm is introduced to the action recognition.
A novel deep neural network architecture called Structured Keypoint Pooling is proposed.
- Score: 3.255030588361124
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper simultaneously addresses three limitations associated with
conventional skeleton-based action recognition; skeleton detection and tracking
errors, poor variety of the targeted actions, as well as person-wise and
frame-wise action recognition. A point cloud deep-learning paradigm is
introduced to the action recognition, and a unified framework along with a
novel deep neural network architecture called Structured Keypoint Pooling is
proposed. The proposed method sparsely aggregates keypoint features in a
cascaded manner based on prior knowledge of the data structure (which is
inherent in skeletons), such as the instances and frames to which each keypoint
belongs, and achieves robustness against input errors. Its less constrained and
tracking-free architecture enables time-series keypoints consisting of human
skeletons and nonhuman object contours to be efficiently treated as an input 3D
point cloud and extends the variety of the targeted action. Furthermore, we
propose a Pooling-Switching Trick inspired by Structured Keypoint Pooling. This
trick switches the pooling kernels between the training and inference phases to
detect person-wise and frame-wise actions in a weakly supervised manner using
only video-level action labels. This trick enables our training scheme to
naturally introduce novel data augmentation, which mixes multiple point clouds
extracted from different videos. In the experiments, we comprehensively verify
the effectiveness of the proposed method against the limitations, and the
method outperforms state-of-the-art skeleton-based action recognition and
spatio-temporal action localization methods.
Related papers
- Expressive Keypoints for Skeleton-based Action Recognition via Skeleton Transformation [14.033701085783177]
We propose Expressive Keypoints that incorporates hand and foot details to form a fine-grained skeletal representation, improving the discriminative ability for existing models in discerning intricate actions.
A plug-and-play Instance Pooling module is exploited to extend our approach to multi-person scenarios without surging computation costs.
arXiv Detail & Related papers (2024-06-26T01:48:56Z) - An Information Compensation Framework for Zero-Shot Skeleton-based Action Recognition [49.45660055499103]
Zero-shot human skeleton-based action recognition aims to construct a model that can recognize actions outside the categories seen during training.
Previous research has focused on aligning sequences' visual and semantic spatial distributions.
We introduce a new loss function sampling method to obtain a tight and robust representation.
arXiv Detail & Related papers (2024-06-02T06:53:01Z) - Multi-view Action Recognition via Directed Gromov-Wasserstein Discrepancy [12.257725479880458]
Action recognition has become one of the popular research topics in computer vision.
We propose a multi-view attention consistency method that computes the similarity between two attentions from two different views of the action videos.
Our approach applies the idea of Neural Radiance Field to implicitly render the features from novel views when training on single-view datasets.
arXiv Detail & Related papers (2024-05-02T14:43:21Z) - Improving Video Violence Recognition with Human Interaction Learning on
3D Skeleton Point Clouds [88.87985219999764]
We develop a method for video violence recognition from a new perspective of skeleton points.
We first formulate 3D skeleton point clouds from human sequences extracted from videos.
We then perform interaction learning on these 3D skeleton point clouds.
arXiv Detail & Related papers (2023-08-26T12:55:18Z) - Self-supervised Action Representation Learning from Partial
Spatio-Temporal Skeleton Sequences [29.376328807860993]
We propose a Partial Spatio-Temporal Learning (PSTL) framework to exploit the local relationship between different skeleton joints and video frames.
Our method achieves state-of-the-art performance on NTURGB+D 60, NTURGBMM+D 120 and PKU-D under various downstream tasks.
arXiv Detail & Related papers (2023-02-17T17:35:05Z) - From Keypoints to Object Landmarks via Self-Training Correspondence: A
novel approach to Unsupervised Landmark Discovery [37.78933209094847]
This paper proposes a novel paradigm for the unsupervised learning of object landmark detectors.
We validate our method on a variety of difficult datasets, including LS3D, BBCPose, Human3.6M and PennAction.
arXiv Detail & Related papers (2022-05-31T15:44:29Z) - Revisiting spatio-temporal layouts for compositional action recognition [63.04778884595353]
We take an object-centric approach to action recognition.
The main focus of this paper is compositional/few-shot action recognition.
We demonstrate how to improve the performance of appearance-based models by fusion with layout-based models.
arXiv Detail & Related papers (2021-11-02T23:04:39Z) - Skeleton-Based Mutually Assisted Interacted Object Localization and
Human Action Recognition [111.87412719773889]
We propose a joint learning framework for "interacted object localization" and "human action recognition" based on skeleton data.
Our method achieves the best or competitive performance with the state-of-the-art methods for human action recognition.
arXiv Detail & Related papers (2021-10-28T10:09:34Z) - Real-time Human Action Recognition Using Locally Aggregated
Kinematic-Guided Skeletonlet and Supervised Hashing-by-Analysis Model [30.435850177921086]
3D action recognition suffers from three problems: highly complicated articulation, a great amount of noise, and a low implementation efficiency.
We propose a real-time 3D action recognition framework by integrating the locally aggregated kinematic-guided skeletonlet (LAKS) with a supervised hashing-by-analysis (SHA) model.
Experimental results on MSRAction3D, UTKinectAction3D and Florence3DAction datasets demonstrate that the proposed method outperforms state-of-the-art methods in both recognition accuracy and implementation efficiency.
arXiv Detail & Related papers (2021-05-24T14:46:40Z) - Panoster: End-to-end Panoptic Segmentation of LiDAR Point Clouds [81.12016263972298]
We present Panoster, a novel proposal-free panoptic segmentation method for LiDAR point clouds.
Unlike previous approaches, Panoster proposes a simplified framework incorporating a learning-based clustering solution to identify instances.
At inference time, this acts as a class-agnostic segmentation, allowing Panoster to be fast, while outperforming prior methods in terms of accuracy.
arXiv Detail & Related papers (2020-10-28T18:10:20Z) - Skeleton-Aware Networks for Deep Motion Retargeting [83.65593033474384]
We introduce a novel deep learning framework for data-driven motion between skeletons.
Our approach learns how to retarget without requiring any explicit pairing between the motions in the training set.
arXiv Detail & Related papers (2020-05-12T12:51:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.