Explore Human Parsing Modality for Action Recognition
- URL: http://arxiv.org/abs/2401.02138v1
- Date: Thu, 4 Jan 2024 08:43:41 GMT
- Title: Explore Human Parsing Modality for Action Recognition
- Authors: Jinfu Liu, Runwei Ding, Yuhang Wen, Nan Dai, Fanyang Meng, Shen Zhao,
Mengyuan Liu
- Abstract summary: We propose a new dual-branch framework called Ensemble Human Parsing and Pose Network (EPP-Net)
EPP-Net is the first to leverage both skeletons and human parsing modalities for action recognition.
- Score: 17.624946657761996
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal-based action recognition methods have achieved high success using
pose and RGB modality. However, skeletons sequences lack appearance depiction
and RGB images suffer irrelevant noise due to modality limitations. To address
this, we introduce human parsing feature map as a novel modality, since it can
selectively retain effective semantic features of the body parts, while
filtering out most irrelevant noise. We propose a new dual-branch framework
called Ensemble Human Parsing and Pose Network (EPP-Net), which is the first to
leverage both skeletons and human parsing modalities for action recognition.
The first human pose branch feeds robust skeletons in graph convolutional
network to model pose features, while the second human parsing branch also
leverages depictive parsing feature maps to model parsing festures via
convolutional backbones. The two high-level features will be effectively
combined through a late fusion strategy for better action recognition.
Extensive experiments on NTU RGB+D and NTU RGB+D 120 benchmarks consistently
verify the effectiveness of our proposed EPP-Net, which outperforms the
existing action recognition methods. Our code is available at:
https://github.com/liujf69/EPP-Net-Action.
Related papers
- Skeleton-based Group Activity Recognition via Spatial-Temporal Panoramic Graph [4.075741925017479]
Group Activity Recognition aims to understand collective activities from videos.
Existing solutions rely on the RGB modality, which encounters challenges such as background variations.
We design a panoramic graph that incorporates multi-person skeletons and objects to encapsulate group activity.
arXiv Detail & Related papers (2024-07-28T13:57:03Z) - Integrating Human Parsing and Pose Network for Human Action Recognition [12.308394270240463]
We introduce human parsing feature map as a novel modality for action recognition.
We propose Integrating Human Parsing and Pose Network (IPP-Net) for action recognition.
IPP-Net is the first to leverage both skeletons and human parsing feature maps dualbranch approach.
arXiv Detail & Related papers (2023-07-16T07:58:29Z) - DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - Joint-bone Fusion Graph Convolutional Network for Semi-supervised
Skeleton Action Recognition [65.78703941973183]
We propose a novel correlation-driven joint-bone fusion graph convolutional network (CD-JBF-GCN) as an encoder and use a pose prediction head as a decoder.
Specifically, the CD-JBF-GC can explore the motion transmission between the joint stream and the bone stream.
The pose prediction based auto-encoder in the self-supervised training stage allows the network to learn motion representation from unlabeled data.
arXiv Detail & Related papers (2022-02-08T16:03:15Z) - Learning Multi-Granular Spatio-Temporal Graph Network for Skeleton-based
Action Recognition [49.163326827954656]
We propose a novel multi-granular-temporal graph network for skeleton-based action classification.
We develop a dual-head graph network consisting of two inter-leaved branches, which enables us to extract at least two-temporal resolutions.
We conduct extensive experiments on three large-scale datasets.
arXiv Detail & Related papers (2021-08-10T09:25:07Z) - Leveraging Third-Order Features in Skeleton-Based Action Recognition [26.349722372701482]
Skeleton sequences are light-weight and compact, and thus ideal candidates for action recognition on edge devices.
Recent action recognition methods extract features from 3D joint coordinates as spatial-temporal cues, using these representations in a graph neural network for feature fusion.
We propose fusing third-order features in the form of angles into modern architectures, to robustly capture the relationships between joints and body parts.
arXiv Detail & Related papers (2021-05-04T15:23:29Z) - Revisiting Skeleton-based Action Recognition [107.08112310075114]
PoseC3D is a new approach to skeleton-based action recognition, which relies on a 3D heatmap instead stack a graph sequence as the base representation of human skeletons.
On four challenging datasets, PoseC3D consistently obtains superior performance, when used alone on skeletons and in combination with the RGB modality.
arXiv Detail & Related papers (2021-04-28T06:32:17Z) - Glance and Gaze: Inferring Action-aware Points for One-Stage
Human-Object Interaction Detection [81.32280287658486]
We propose a novel one-stage method, namely Glance and Gaze Network (GGNet)
GGNet adaptively models a set of actionaware points (ActPoints) via glance and gaze steps.
We design an actionaware approach that effectively matches each detected interaction with its associated human-object pair.
arXiv Detail & Related papers (2021-04-12T08:01:04Z) - A Graph-based Interactive Reasoning for Human-Object Interaction
Detection [71.50535113279551]
We present a novel graph-based interactive reasoning model called Interactive Graph (abbr. in-Graph) to infer HOIs.
We construct a new framework to assemble in-Graph models for detecting HOIs, namely in-GraphNet.
Our framework is end-to-end trainable and free from costly annotations like human pose.
arXiv Detail & Related papers (2020-07-14T09:29:03Z) - Skeleton Focused Human Activity Recognition in RGB Video [11.521107108725188]
We propose a multimodal feature fusion model that utilizes both skeleton and RGB modalities to infer human activity.
The model could be either individually or uniformly trained by the back-propagation algorithm in an end-to-end manner.
arXiv Detail & Related papers (2020-04-29T06:40:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.