Toward Accurate Person-level Action Recognition in Videos of Crowded
Scenes
- URL: http://arxiv.org/abs/2010.08365v1
- Date: Fri, 16 Oct 2020 13:08:50 GMT
- Title: Toward Accurate Person-level Action Recognition in Videos of Crowded
Scenes
- Authors: Li Yuan, Yichen Zhou, Shuning Chang, Ziyuan Huang, Yunpeng Chen,
Xuecheng Nie, Tao Wang, Jiashi Feng, Shuicheng Yan
- Abstract summary: We focus on improving the action recognition by fully-utilizing the information of scenes and collecting new data.
Specifically, we adopt a strong human detector to detect spatial location of each frame.
We then apply action recognition models to learn thetemporal information from video frames on both the HIE dataset and new data with diverse scenes from the internet.
- Score: 131.9067467127761
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Detecting and recognizing human action in videos with crowded scenes is a
challenging problem due to the complex environment and diversity events. Prior
works always fail to deal with this problem in two aspects: (1) lacking
utilizing information of the scenes; (2) lacking training data in the crowd and
complex scenes. In this paper, we focus on improving spatio-temporal action
recognition by fully-utilizing the information of scenes and collecting new
data. A top-down strategy is used to overcome the limitations. Specifically, we
adopt a strong human detector to detect the spatial location of each frame. We
then apply action recognition models to learn the spatio-temporal information
from video frames on both the HIE dataset and new data with diverse scenes from
the internet, which can improve the generalization ability of our model.
Besides, the scenes information is extracted by the semantic segmentation model
to assistant the process. As a result, our method achieved an average 26.05
wf\_mAP (ranking 1st place in the ACM MM grand challenge 2020: Human in
Events).
Related papers
- Unveiling Context-Related Anomalies: Knowledge Graph Empowered Decoupling of Scene and Action for Human-Related Video Anomaly Detection [29.900229206335908]
Detection of anomalies in human-related videos is crucial for surveillance applications.
Current methods rely on appearance-based and action-based techniques.
We propose a novel decoupling-based architecture for human-related video anomaly detection (DecoAD)
arXiv Detail & Related papers (2024-09-05T04:13:13Z) - Reconstructing Close Human Interactions from Multiple Views [38.924950289788804]
This paper addresses the challenging task of reconstructing the poses of multiple individuals engaged in close interactions, captured by multiple calibrated cameras.
We introduce a novel system to address these challenges.
Our system integrates a learning-based pose estimation component and its corresponding training and inference strategies.
arXiv Detail & Related papers (2024-01-29T14:08:02Z) - Multi-Modal Domain Adaptation Across Video Scenes for Temporal Video
Grounding [59.599378814835205]
Temporal Video Grounding (TVG) aims to localize the temporal boundary of a specific segment in an untrimmed video based on a given language query.
We introduce a novel AMDA method to adaptively adjust the model's scene-related knowledge by incorporating insights from the target data.
arXiv Detail & Related papers (2023-12-21T07:49:27Z) - Generalizable Person Search on Open-world User-Generated Video Content [93.72028298712118]
Person search is a challenging task that involves retrieving individuals from a large set of un-cropped scene images.
Existing person search applications are mostly trained and deployed in the same-origin scenarios.
We propose a generalizable framework on both feature-level and data-level generalization to facilitate downstream tasks in arbitrary scenarios.
arXiv Detail & Related papers (2023-10-16T04:59:50Z) - Towards Accurate Human Pose Estimation in Videos of Crowded Scenes [134.60638597115872]
We focus on improving human pose estimation in videos of crowded scenes from the perspectives of exploiting temporal context and collecting new data.
For one frame, we forward the historical poses from the previous frames and backward the future poses from the subsequent frames to current frame, leading to stable and accurate human pose estimation in videos.
In this way, our model achieves best performance on 7 out of 13 videos and 56.33 average w_AP on test dataset of HIE challenge.
arXiv Detail & Related papers (2020-10-16T13:19:11Z) - Enhancing Unsupervised Video Representation Learning by Decoupling the
Scene and the Motion [86.56202610716504]
Action categories are highly related with the scene where the action happens, making the model tend to degrade to a solution where only the scene information is encoded.
We propose to decouple the scene and the motion (DSM) with two simple operations, so that the model attention towards the motion information is better paid.
arXiv Detail & Related papers (2020-09-12T09:54:11Z) - Human in Events: A Large-Scale Benchmark for Human-centric Video
Analysis in Complex Events [106.19047816743988]
We present a new large-scale dataset with comprehensive annotations, named Human-in-Events or HiEve.
It contains a record number of poses (>1M), the largest number of action instances (>56k) under complex events, as well as one of the largest numbers of trajectories lasting for longer time.
Based on its diverse annotation, we present two simple baselines for action recognition and pose estimation.
arXiv Detail & Related papers (2020-05-09T18:24:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.