Human in Events: A Large-Scale Benchmark for Human-centric Video
Analysis in Complex Events
- URL: http://arxiv.org/abs/2005.04490v6
- Date: Thu, 13 Jul 2023 13:23:05 GMT
- Title: Human in Events: A Large-Scale Benchmark for Human-centric Video
Analysis in Complex Events
- Authors: Weiyao Lin, Huabin Liu, Shizhan Liu, Yuxi Li, Rui Qian, Tao Wang, Ning
Xu, Hongkai Xiong, Guo-Jun Qi, Nicu Sebe
- Abstract summary: We present a new large-scale dataset with comprehensive annotations, named Human-in-Events or HiEve.
It contains a record number of poses (>1M), the largest number of action instances (>56k) under complex events, as well as one of the largest numbers of trajectories lasting for longer time.
Based on its diverse annotation, we present two simple baselines for action recognition and pose estimation.
- Score: 106.19047816743988
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Along with the development of modern smart cities, human-centric video
analysis has been encountering the challenge of analyzing diverse and complex
events in real scenes. A complex event relates to dense crowds, anomalous
individuals, or collective behaviors. However, limited by the scale and
coverage of existing video datasets, few human analysis approaches have
reported their performances on such complex events. To this end, we present a
new large-scale dataset with comprehensive annotations, named Human-in-Events
or HiEve (Human-centric video analysis in complex Events), for the
understanding of human motions, poses, and actions in a variety of realistic
events, especially in crowd & complex events. It contains a record number of
poses (>1M), the largest number of action instances (>56k) under complex
events, as well as one of the largest numbers of trajectories lasting for
longer time (with an average trajectory length of >480 frames). Based on its
diverse annotation, we present two simple baselines for action recognition and
pose estimation, respectively. They leverage cross-label information during
training to enhance the feature learning in corresponding visual tasks.
Experiments show that they could boost the performance of existing action
recognition and pose estimation pipelines. More importantly, they prove the
widely ranged annotations in HiEve can improve various video tasks.
Furthermore, we conduct extensive experiments to benchmark recent video
analysis approaches together with our baseline methods, demonstrating HiEve is
a challenging dataset for human-centric video analysis. We expect that the
dataset will advance the development of cutting-edge techniques in
human-centric analysis and the understanding of complex events. The dataset is
available at http://humaninevents.org
Related papers
- A Survey of Video Datasets for Grounded Event Understanding [34.11140286628736]
multimodal AI systems must be capable of well-rounded common-sense reasoning akin to human visual understanding.
We survey 105 video datasets that require event understanding capability.
arXiv Detail & Related papers (2024-06-14T00:36:55Z) - CinePile: A Long Video Question Answering Dataset and Benchmark [58.08209212057164]
Current datasets for long-form video understanding often fall short of providing genuine long-form comprehension challenges.
We present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding.
arXiv Detail & Related papers (2024-05-14T17:59:02Z) - SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos [43.536874272236986]
We propose a new video visual relation detection task: video human-human interaction detection.
SportsHHI contains 34 high-level interaction classes from basketball and volleyball sports.
We conduct extensive experiments to reveal the key factors for a successful human-human interaction detector.
arXiv Detail & Related papers (2024-04-06T09:13:03Z) - EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World [44.34800426136217]
We introduce EgoExoLearn, a dataset that emulates the human demonstration following process.
EgoExoLearn contains egocentric and demonstration video data spanning 120 hours.
We present benchmarks such as cross-view association, cross-view action planning, and cross-view referenced skill assessment.
arXiv Detail & Related papers (2024-03-24T15:00:44Z) - Human-centric Scene Understanding for 3D Large-scale Scenarios [52.12727427303162]
We present a large-scale multi-modal dataset for human-centric scene understanding, dubbed HuCenLife.
Our HuCenLife can benefit many 3D perception tasks, such as segmentation, detection, action recognition, etc.
arXiv Detail & Related papers (2023-07-26T08:40:46Z) - JRDB-Act: A Large-scale Multi-modal Dataset for Spatio-temporal Action,
Social Group and Activity Detection [54.696819174421584]
We introduce JRDB-Act, a multi-modal dataset that reflects a real distribution of human daily life actions in a university campus environment.
JRDB-Act has been densely annotated with atomic actions, comprises over 2.8M action labels.
JRDB-Act comes with social group identification annotations conducive to the task of grouping individuals based on their interactions in the scene.
arXiv Detail & Related papers (2021-06-16T14:43:46Z) - Toward Accurate Person-level Action Recognition in Videos of Crowded
Scenes [131.9067467127761]
We focus on improving the action recognition by fully-utilizing the information of scenes and collecting new data.
Specifically, we adopt a strong human detector to detect spatial location of each frame.
We then apply action recognition models to learn thetemporal information from video frames on both the HIE dataset and new data with diverse scenes from the internet.
arXiv Detail & Related papers (2020-10-16T13:08:50Z) - A Simple Baseline for Pose Tracking in Videos of Crowded Scenes [130.84731947842664]
How to track the human pose in crowded and complex environments has not been well addressed.
We use a multi-object tracking method to assign human ID to each bounding box generated by the detection model.
At last, optical flow is used to take advantage of the temporal information in the videos and generate the final pose tracking result.
arXiv Detail & Related papers (2020-10-16T13:06:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.