Part-level Action Parsing via a Pose-guided Coarse-to-Fine Framework
- URL: http://arxiv.org/abs/2203.04476v1
- Date: Wed, 9 Mar 2022 01:30:57 GMT
- Title: Part-level Action Parsing via a Pose-guided Coarse-to-Fine Framework
- Authors: Xiaodong Chen, Xinchen Liu, Wu Liu, Kun Liu, Dong Wu, Yongdong Zhang,
Tao Mei
- Abstract summary: Part-level Action Parsing (PAP) aims to not only predict the video-level action but also recognize the frame-level fine-grained actions or interactions of body parts for each person in the video.
In particular, our framework first predicts the video-level class of the input video, then localizes the body parts and predicts the part-level action.
Our framework achieves state-of-the-art performance and outperforms existing methods over a 31.10% ROC score.
- Score: 108.70949305791201
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Action recognition from videos, i.e., classifying a video into one of the
pre-defined action types, has been a popular topic in the communities of
artificial intelligence, multimedia, and signal processing. However, existing
methods usually consider an input video as a whole and learn models, e.g.,
Convolutional Neural Networks (CNNs), with coarse video-level class labels.
These methods can only output an action class for the video, but cannot provide
fine-grained and explainable cues to answer why the video shows a specific
action. Therefore, researchers start to focus on a new task, Part-level Action
Parsing (PAP), which aims to not only predict the video-level action but also
recognize the frame-level fine-grained actions or interactions of body parts
for each person in the video. To this end, we propose a coarse-to-fine
framework for this challenging task. In particular, our framework first
predicts the video-level class of the input video, then localizes the body
parts and predicts the part-level action. Moreover, to balance the accuracy and
computation in part-level action parsing, we propose to recognize the
part-level actions by segment-level features. Furthermore, to overcome the
ambiguity of body parts, we propose a pose-guided positional embedding method
to accurately localize body parts. Through comprehensive experiments on a
large-scale dataset, i.e., Kinetics-TPS, our framework achieves
state-of-the-art performance and outperforms existing methods over a 31.10% ROC
score.
Related papers
- Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition [84.31749632725929]
In this paper, we focus on one critical challenge of the task, namely scene bias, and accordingly contribute a novel scene-aware video-text alignment method.
Our key idea is to distinguish video representations apart from scene-encoded text representations, aiming to learn scene-agnostic video representations for recognizing actions across domains.
arXiv Detail & Related papers (2024-03-03T16:48:16Z) - Technical Report: Disentangled Action Parsing Networks for Accurate
Part-level Action Parsing [65.87931036949458]
Part-level Action Parsing aims at part state parsing for boosting action recognition in videos.
We present a simple yet effective approach, named disentangled action parsing (DAP)
arXiv Detail & Related papers (2021-11-05T02:29:32Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - A Survey on Deep Learning Technique for Video Segmentation [147.0767454918527]
Video segmentation plays a critical role in a broad range of practical applications.
Deep learning based approaches have been dedicated to video segmentation and delivered compelling performance.
arXiv Detail & Related papers (2021-07-02T15:51:07Z) - Unsupervised Action Segmentation with Self-supervised Feature Learning
and Co-occurrence Parsing [32.66011849112014]
temporal action segmentation is a task to classify each frame in the video with an action label.
In this work we explore a self-supervised method that operates on a corpus of unlabeled videos and predicts a likely set of temporal segments across the videos.
We develop CAP, a novel co-occurrence action parsing algorithm that can not only capture the correlation among sub-actions underlying the structure of activities, but also estimate the temporal trajectory of the sub-actions in an accurate and general way.
arXiv Detail & Related papers (2021-05-29T00:29:40Z) - Improved Actor Relation Graph based Group Activity Recognition [0.0]
The detailed description of human actions and group activities is essential information, which can be used in real-time CCTV video surveillance, health care, sports video analysis, etc.
This study proposes a video understanding method that mainly focused on group activity recognition by learning the pair-wise actor appearance similarity and actor positions.
arXiv Detail & Related papers (2020-10-24T19:46:49Z) - Hierarchical Attention Network for Action Segmentation [45.19890687786009]
The temporal segmentation of events is an essential task and a precursor for the automatic recognition of human actions in the video.
We propose a complete end-to-end supervised learning approach that can better learn relationships between actions over time.
We evaluate our system on challenging public benchmark datasets, including MERL Shopping, 50 salads, and Georgia Tech Egocentric datasets.
arXiv Detail & Related papers (2020-05-07T02:39:18Z) - Motion-supervised Co-Part Segmentation [88.40393225577088]
We propose a self-supervised deep learning method for co-part segmentation.
Our approach develops the idea that motion information inferred from videos can be leveraged to discover meaningful object parts.
arXiv Detail & Related papers (2020-04-07T09:56:45Z) - SCT: Set Constrained Temporal Transformer for Set Supervised Action
Segmentation [22.887397951846353]
Weakly supervised approaches aim at learning temporal action segmentation from videos that are only weakly labeled.
We propose an approach that can be trained end-to-end on such data.
We evaluate our approach on three datasets where the approach achieves state-of-the-art results.
arXiv Detail & Related papers (2020-03-31T14:51:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.