Technical Report: Disentangled Action Parsing Networks for Accurate
Part-level Action Parsing
- URL: http://arxiv.org/abs/2111.03225v1
- Date: Fri, 5 Nov 2021 02:29:32 GMT
- Title: Technical Report: Disentangled Action Parsing Networks for Accurate
Part-level Action Parsing
- Authors: Xuanhan Wang and Xiaojia Chen and Lianli Gao and Lechao Chen and
Jingkuan Song
- Abstract summary: Part-level Action Parsing aims at part state parsing for boosting action recognition in videos.
We present a simple yet effective approach, named disentangled action parsing (DAP)
- Score: 65.87931036949458
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Part-level Action Parsing aims at part state parsing for boosting action
recognition in videos. Despite of dramatic progresses in the area of video
classification research, a severe problem faced by the community is that the
detailed understanding of human actions is ignored. Our motivation is that
parsing human actions needs to build models that focus on the specific problem.
We present a simple yet effective approach, named disentangled action parsing
(DAP). Specifically, we divided the part-level action parsing into three
stages: 1) person detection, where a person detector is adopted to detect all
persons from videos as well as performs instance-level action recognition; 2)
Part parsing, where a part-parsing model is proposed to recognize human parts
from detected person images; and 3) Action parsing, where a multi-modal action
parsing network is used to parse action category conditioning on all detection
results that are obtained from previous stages. With these three major models
applied, our approach of DAP records a global mean of $0.605$ score in 2021
Kinetics-TPS Challenge.
Related papers
- Simultaneous Detection and Interaction Reasoning for Object-Centric Action Recognition [21.655278000690686]
We propose an end-to-end object-centric action recognition framework.
It simultaneously performs Detection And Interaction Reasoning in one stage.
We conduct experiments on two datasets, Something-Else and Ikea-Assembly.
arXiv Detail & Related papers (2024-04-18T05:06:12Z) - Progression-Guided Temporal Action Detection in Videos [20.02711550239915]
We present a novel framework, Action Progression Network (APN), for temporal action detection (TAD) in videos.
The framework locates actions in videos by detecting the action evolution process.
We quantify a complete action process into 101 ordered stages and train a neural network to recognize the action progressions.
arXiv Detail & Related papers (2023-08-18T03:14:05Z) - Integrating Human Parsing and Pose Network for Human Action Recognition [12.308394270240463]
We introduce human parsing feature map as a novel modality for action recognition.
We propose Integrating Human Parsing and Pose Network (IPP-Net) for action recognition.
IPP-Net is the first to leverage both skeletons and human parsing feature maps dualbranch approach.
arXiv Detail & Related papers (2023-07-16T07:58:29Z) - DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - AIParsing: Anchor-free Instance-level Human Parsing [98.80740676794254]
We have designed an instance-level human parsing network which is anchor-free and solvable on a pixel level.
It consists of two simple sub-networks: an anchor-free detection head for bounding box predictions and an edge-guided parsing head for human segmentation.
Our method achieves the best global-level and instance-level performance over state-of-the-art one-stage top-down alternatives.
arXiv Detail & Related papers (2022-07-14T12:19:32Z) - Part-level Action Parsing via a Pose-guided Coarse-to-Fine Framework [108.70949305791201]
Part-level Action Parsing (PAP) aims to not only predict the video-level action but also recognize the frame-level fine-grained actions or interactions of body parts for each person in the video.
In particular, our framework first predicts the video-level class of the input video, then localizes the body parts and predicts the part-level action.
Our framework achieves state-of-the-art performance and outperforms existing methods over a 31.10% ROC score.
arXiv Detail & Related papers (2022-03-09T01:30:57Z) - End-to-end One-shot Human Parsing [91.5113227694443]
One-shot human parsing (OSHP) task requires parsing humans into an open set of classes defined by any test example.
End-to-end One-shot human Parsing Network (EOP-Net) proposed.
EOP-Net outperforms representative one-shot segmentation models by large margins.
arXiv Detail & Related papers (2021-05-04T01:35:50Z) - Glance and Gaze: Inferring Action-aware Points for One-Stage
Human-Object Interaction Detection [81.32280287658486]
We propose a novel one-stage method, namely Glance and Gaze Network (GGNet)
GGNet adaptively models a set of actionaware points (ActPoints) via glance and gaze steps.
We design an actionaware approach that effectively matches each detected interaction with its associated human-object pair.
arXiv Detail & Related papers (2021-04-12T08:01:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.