Real-Time Human Detection for Aerial Captured Video Sequences via Deep Models
- URL: http://arxiv.org/abs/2601.00391v1
- Date: Thu, 01 Jan 2026 17:00:04 GMT
- Title: Real-Time Human Detection for Aerial Captured Video Sequences via Deep Models
- Authors: Nouar AlDahoul, Aznul Qalid Md Sabri, Ali Mohammed Mansoor,
- Abstract summary: We propose automatic feature learning methods, which combine optical flow and three different deep models.<n>The models are trained and tested on the publicly available and highly challenging UCF-ARG aerial dataset.<n> Experimental results demonstrated that the proposed methods are successful for the human detection task.
- Score: 1.4656201740804355
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human detection in videos plays an important role in various real-life applications. Most traditional approaches depend on utilizing handcrafted features, which are problem-dependent and optimal for specific tasks. Moreover, they are highly susceptible to dynamical events such as illumination changes, camera jitter, and variations in object sizes. On the other hand, the proposed feature learning approaches are cheaper and easier because highly abstract and discriminative features can be produced automatically without the need of expert knowledge. In this paper, we utilize automatic feature learning methods, which combine optical flow and three different deep models (i.e., supervised convolutional neural network (S-CNN), pretrained CNN feature extractor, and hierarchical extreme learning machine) for human detection in videos captured using a nonstatic camera on an aerial platform with varying altitudes. The models are trained and tested on the publicly available and highly challenging UCF-ARG aerial dataset. The comparison between these models in terms of training, testing accuracy, and learning speed is analyzed. The performance evaluation considers five human actions (digging, waving, throwing, walking, and running). Experimental results demonstrated that the proposed methods are successful for the human detection task. The pretrained CNN produces an average accuracy of 98.09%. S-CNN produces an average accuracy of 95.6% with softmax and 91.7% with Support Vector Machines (SVM). H-ELM has an average accuracy of 95.9%. Using a normal Central Processing Unit (CPU), H-ELM's training time takes 445 seconds. Learning in S-CNN takes 770 seconds with a high-performance Graphical Processing Unit (GPU).
Related papers
- Leveraging Pre-Trained Visual Models for AI-Generated Video Detection [54.88903878778194]
The field of video generation has advanced beyond DeepFakes, creating an urgent need for methods capable of detecting AI-generated videos with generic content.<n>We propose a novel approach that leverages pre-trained visual models to distinguish between real and generated videos.<n>Our method achieves high detection accuracy, above 90% on average, underscoring its effectiveness.
arXiv Detail & Related papers (2025-07-17T15:36:39Z) - SEVERE++: Evaluating Benchmark Sensitivity in Generalization of Video Representation Learning [78.44705665291741]
We present a comprehensive evaluation of modern video self-supervised models.<n>We focus on generalization across four key downstream factors: domain shift, sample efficiency, action granularity, and task diversity.<n>Our analysis shows that, despite architectural advances, transformer-based models remain sensitive to downstream conditions.
arXiv Detail & Related papers (2025-04-08T06:00:28Z) - Comparison of gait phase detection using traditional machine learning
and deep learning techniques [3.11526333124308]
This study proposes a few Machine Learning (ML) based models on lower-limb EMG data for human walking.
The results show up to 75% average accuracy for traditional ML models and 79% for Deep Learning (DL) model.
arXiv Detail & Related papers (2024-03-07T10:05:09Z) - LoRA-like Calibration for Multimodal Deception Detection using ATSFace
Data [1.550120821358415]
We introduce an attention-aware neural network addressing challenges inherent in video data and deception dynamics.
We employ a multimodal fusion strategy that enhances accuracy; our approach yields a 92% accuracy rate on a real-life trial dataset.
arXiv Detail & Related papers (2023-09-04T06:22:25Z) - Evaluating raw waveforms with deep learning frameworks for speech
emotion recognition [0.0]
We represent a model, which feeds raw audio files directly into the deep neural networks without any feature extraction stage.
We use six different data sets, EMO-DB, RAVDESS, TESS, CREMA, SAVEE, and TESS+RAVDESS.
The proposed model performs 90.34% of accuracy for EMO-DB with CNN model, 90.42% of accuracy for RAVDESS, 99.48% of accuracy for TESS with LSTM model, 69.72% of accuracy for CREMA with CNN model, 85.76% of accuracy for SAVEE with CNN model in
arXiv Detail & Related papers (2023-07-06T07:27:59Z) - Human activity recognition using deep learning approaches and single
frame cnn and convolutional lstm [0.0]
We explore two deep learning-based approaches, namely single frame Convolutional Neural Networks (CNNs) and convolutional Long Short-Term Memory to recognise human actions from videos.
The two models were trained and evaluated on a benchmark action recognition dataset, UCF50, and another dataset that was created for the experimentation.
Though both models exhibit good accuracies, the single frame CNN model outperforms the Convolutional LSTM model by having an accuracy of 99.8% with the UCF50 dataset.
arXiv Detail & Related papers (2023-04-18T01:33:29Z) - Differentiable Frequency-based Disentanglement for Aerial Video Action
Recognition [56.91538445510214]
We present a learning algorithm for human activity recognition in videos.
Our approach is designed for UAV videos, which are mainly acquired from obliquely placed dynamic cameras.
We conduct extensive experiments on the UAV Human dataset and the NEC Drone dataset.
arXiv Detail & Related papers (2022-09-15T22:16:52Z) - Revisiting Classifier: Transferring Vision-Language Models for Video
Recognition [102.93524173258487]
Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research.
In this study, we focus on transferring knowledge for video classification tasks.
We utilize the well-pretrained language model to generate good semantic target for efficient transferring learning.
arXiv Detail & Related papers (2022-07-04T10:00:47Z) - CONVIQT: Contrastive Video Quality Estimator [63.749184706461826]
Perceptual video quality assessment (VQA) is an integral component of many streaming and video sharing platforms.
Here we consider the problem of learning perceptually relevant video quality representations in a self-supervised manner.
Our results indicate that compelling representations with perceptual bearing can be obtained using self-supervised learning.
arXiv Detail & Related papers (2022-06-29T15:22:01Z) - Where is my hand? Deep hand segmentation for visual self-recognition in
humanoid robots [129.46920552019247]
We propose the use of a Convolution Neural Network (CNN) to segment the robot hand from an image in an egocentric view.
We fine-tuned the Mask-RCNN network for the specific task of segmenting the hand of the humanoid robot Vizzy.
arXiv Detail & Related papers (2021-02-09T10:34:32Z) - Ensembles of Deep Neural Networks for Action Recognition in Still Images [3.7900158137749336]
We propose a transfer learning technique to tackle the lack of massive labeled action recognition datasets.
We also use eight different pre-trained CNNs in our framework and investigate their performance on Stanford 40 dataset.
The best setting of our method is able to achieve 93.17$%$ accuracy on the Stanford 40 dataset.
arXiv Detail & Related papers (2020-03-22T13:44:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.