A Multi-viewpoint Outdoor Dataset for Human Action Recognition
- URL: http://arxiv.org/abs/2110.04119v1
- Date: Thu, 7 Oct 2021 14:50:43 GMT
- Title: A Multi-viewpoint Outdoor Dataset for Human Action Recognition
- Authors: Asanka G. Perera, Yee Wei Law, Titilayo T. Ogunwa, and Javaan Chahl
- Abstract summary: We present a multi-viewpoint outdoor action recognition dataset collected from YouTube and our own drone.
The dataset consists of 20 dynamic human action classes, 2324 video clips and 503086 frames.
The overall baseline action recognition accuracy is 74.0%.
- Score: 3.522154868524807
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Advancements in deep neural networks have contributed to near perfect results
for many computer vision problems such as object recognition, face recognition
and pose estimation. However, human action recognition is still far from
human-level performance. Owing to the articulated nature of the human body, it
is challenging to detect an action from multiple viewpoints, particularly from
an aerial viewpoint. This is further compounded by a scarcity of datasets that
cover multiple viewpoints of actions. To fill this gap and enable research in
wider application areas, we present a multi-viewpoint outdoor action
recognition dataset collected from YouTube and our own drone. The dataset
consists of 20 dynamic human action classes, 2324 video clips and 503086
frames. All videos are cropped and resized to 720x720 without distorting the
original aspect ratio of the human subjects in videos. This dataset should be
useful to many research areas including action recognition, surveillance and
situational awareness. We evaluated the dataset with a two-stream CNN
architecture coupled with a recently proposed temporal pooling scheme called
kernelized rank pooling that produces nonlinear feature subspace
representations. The overall baseline action recognition accuracy is 74.0%.
Related papers
- Learning Human Action Recognition Representations Without Real Humans [66.61527869763819]
We present a benchmark that leverages real-world videos with humans removed and synthetic data containing virtual humans to pre-train a model.
We then evaluate the transferability of the representation learned on this data to a diverse set of downstream action recognition benchmarks.
Our approach outperforms previous baselines by up to 5%.
arXiv Detail & Related papers (2023-11-10T18:38:14Z) - DNA-Rendering: A Diverse Neural Actor Repository for High-Fidelity
Human-centric Rendering [126.00165445599764]
We present DNA-Rendering, a large-scale, high-fidelity repository of human performance data for neural actor rendering.
Our dataset contains over 1500 human subjects, 5000 motion sequences, and 67.5M frames' data volume.
We construct a professional multi-view system to capture data, which contains 60 synchronous cameras with max 4096 x 3000 resolution, 15 fps speed, and stern camera calibration steps.
arXiv Detail & Related papers (2023-07-19T17:58:03Z) - Deep Neural Networks in Video Human Action Recognition: A Review [21.00217656391331]
Video behavior recognition is one of the most foundational tasks of computer vision.
Deep neural networks are built for recognizing pixel-level information such as images with RGB, RGB-D, or optical flow formats.
In our article, the performance of deep neural networks surpassed most of the techniques in the feature learning and extraction tasks.
arXiv Detail & Related papers (2023-05-25T03:54:41Z) - Differentiable Frequency-based Disentanglement for Aerial Video Action
Recognition [56.91538445510214]
We present a learning algorithm for human activity recognition in videos.
Our approach is designed for UAV videos, which are mainly acquired from obliquely placed dynamic cameras.
We conduct extensive experiments on the UAV Human dataset and the NEC Drone dataset.
arXiv Detail & Related papers (2022-09-15T22:16:52Z) - HighlightMe: Detecting Highlights from Human-Centric Videos [52.84233165201391]
We present a domain- and user-preference-agnostic approach to detect highlightable excerpts from human-centric videos.
We use an autoencoder network equipped with spatial-temporal graph convolutions to detect human activities and interactions.
We observe a 4-12% improvement in the mean average precision of matching the human-annotated highlights over state-of-the-art methods.
arXiv Detail & Related papers (2021-10-05T01:18:15Z) - TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild [77.59069361196404]
TRiPOD is a novel method for predicting body dynamics based on graph attentional networks.
To incorporate a real-world challenge, we learn an indicator representing whether an estimated body joint is visible/invisible at each frame.
Our evaluation shows that TRiPOD outperforms all prior work and state-of-the-art specifically designed for each of the trajectory and pose forecasting tasks.
arXiv Detail & Related papers (2021-04-08T20:01:00Z) - UAV-Human: A Large Benchmark for Human Behavior Understanding with
Unmanned Aerial Vehicles [12.210724541266183]
We propose a new benchmark - UAVHuman - for human behavior understanding with UAVs.
Our dataset contains 67,428 multi-modal video sequences and 119 subjects for action recognition.
We propose a fisheye-based action recognition method that mitigates the distortions in fisheye videos via learning transformations guided by flat RGB videos.
arXiv Detail & Related papers (2021-04-02T08:54:04Z) - Video Action Recognition Using spatio-temporal optical flow video frames [0.0]
There are many problems associated with recognizing human actions in videos.
This paper focus on spatial and temporal pattern recognition for the classification of videos using Deep Neural Networks.
The final recognition accuracy was about 94%.
arXiv Detail & Related papers (2021-02-05T19:46:49Z) - Toward Accurate Person-level Action Recognition in Videos of Crowded
Scenes [131.9067467127761]
We focus on improving the action recognition by fully-utilizing the information of scenes and collecting new data.
Specifically, we adopt a strong human detector to detect spatial location of each frame.
We then apply action recognition models to learn thetemporal information from video frames on both the HIE dataset and new data with diverse scenes from the internet.
arXiv Detail & Related papers (2020-10-16T13:08:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.