DroneAttention: Sparse Weighted Temporal Attention for Drone-Camera
Based Activity Recognition
- URL: http://arxiv.org/abs/2212.03384v1
- Date: Wed, 7 Dec 2022 00:33:40 GMT
- Title: DroneAttention: Sparse Weighted Temporal Attention for Drone-Camera
Based Activity Recognition
- Authors: Santosh Kumar Yadav, Achleshwar Luthra, Esha Pahwa, Kamlesh Tiwari,
Heena Rathore, Hari Mohan Pandey, Peter Corcoran
- Abstract summary: Human activity recognition (HAR) using drone-mounted cameras has attracted considerable interest from the computer vision research community in recent years.
We propose a novel Sparse Weighted Temporal Attention (SWTA) module to utilize sparsely sampled video frames for obtaining global weighted temporal attention.
The proposed model has received an accuracy of 72.76%, 92.56%, and 78.86% on the respective datasets.
- Score: 2.705905918316948
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human activity recognition (HAR) using drone-mounted cameras has attracted
considerable interest from the computer vision research community in recent
years. A robust and efficient HAR system has a pivotal role in fields like
video surveillance, crowd behavior analysis, sports analysis, and
human-computer interaction. What makes it challenging are the complex poses,
understanding different viewpoints, and the environmental scenarios where the
action is taking place. To address such complexities, in this paper, we propose
a novel Sparse Weighted Temporal Attention (SWTA) module to utilize sparsely
sampled video frames for obtaining global weighted temporal attention. The
proposed SWTA is comprised of two parts. First, temporal segment network that
sparsely samples a given set of frames. Second, weighted temporal attention,
which incorporates a fusion of attention maps derived from optical flow, with
raw RGB images. This is followed by a basenet network, which comprises a
convolutional neural network (CNN) module along with fully connected layers
that provide us with activity recognition. The SWTA network can be used as a
plug-in module to the existing deep CNN architectures, for optimizing them to
learn temporal information by eliminating the need for a separate temporal
stream. It has been evaluated on three publicly available benchmark datasets,
namely Okutama, MOD20, and Drone-Action. The proposed model has received an
accuracy of 72.76%, 92.56%, and 78.86% on the respective datasets thereby
surpassing the previous state-of-the-art performances by a margin of 25.26%,
18.56%, and 2.94%, respectively.
Related papers
- Temporal-Spatial Processing of Event Camera Data via Delay-Loop Reservoir Neural Network [0.11309478649967238]
We study a conjecture motivated by our previous study of video processing with delay loop reservoir neural network.
In this paper, we will exploit this new finding to guide our design of a delay-loop reservoir neural network for event camera classification.
arXiv Detail & Related papers (2024-02-12T16:24:13Z) - SWTF: Sparse Weighted Temporal Fusion for Drone-Based Activity
Recognition [2.7677069267434873]
Drone-camera based human activity recognition (HAR) has received significant attention from the computer vision research community.
We propose a novel Sparse Weighted Temporal Fusion (SWTF) module to utilize sparsely sampled video frames.
The proposed model has received an accuracy of 72.76%, 92.56%, and 78.86% on the respective datasets.
arXiv Detail & Related papers (2022-11-10T12:45:43Z) - FuTH-Net: Fusing Temporal Relations and Holistic Features for Aerial
Video Classification [49.06447472006251]
We propose a novel deep neural network, termed FuTH-Net, to model not only holistic features, but also temporal relations for aerial video classification.
Our model is evaluated on two aerial video classification datasets, ERA and Drone-Action, and achieves the state-of-the-art results.
arXiv Detail & Related papers (2022-09-22T21:15:58Z) - MFGNet: Dynamic Modality-Aware Filter Generation for RGB-T Tracking [72.65494220685525]
We propose a new dynamic modality-aware filter generation module (named MFGNet) to boost the message communication between visible and thermal data.
We generate dynamic modality-aware filters with two independent networks. The visible and thermal filters will be used to conduct a dynamic convolutional operation on their corresponding input feature maps respectively.
To address issues caused by heavy occlusion, fast motion, and out-of-view, we propose to conduct a joint local and global search by exploiting a new direction-aware target-driven attention mechanism.
arXiv Detail & Related papers (2021-07-22T03:10:51Z) - Detection, Tracking, and Counting Meets Drones in Crowds: A Benchmark [97.07865343576361]
We construct a benchmark with a new drone-captured largescale dataset, named as DroneCrowd.
We annotate 20,800 people trajectories with 4.8 million heads and several video-level attributes.
We design the Space-Time Neighbor-Aware Network (STNNet) as a strong baseline to solve object detection, tracking and counting jointly in dense crowds.
arXiv Detail & Related papers (2021-05-06T04:46:14Z) - ACDnet: An action detection network for real-time edge computing based
on flow-guided feature approximation and memory aggregation [8.013823319651395]
ACDnet is a compact action detection network targeting real-time edge computing.
It exploits the temporal coherence between successive video frames to approximate CNN features rather than naively extracting them.
It can robustly achieve detection well above real-time (75 FPS)
arXiv Detail & Related papers (2021-02-26T14:06:31Z) - DS-Net: Dynamic Spatiotemporal Network for Video Salient Object
Detection [78.04869214450963]
We propose a novel dynamic temporal-temporal network (DSNet) for more effective fusion of temporal and spatial information.
We show that the proposed method achieves superior performance than state-of-the-art algorithms.
arXiv Detail & Related papers (2020-12-09T06:42:30Z) - Fast Motion Understanding with Spatiotemporal Neural Networks and
Dynamic Vision Sensors [99.94079901071163]
This paper presents a Dynamic Vision Sensor (DVS) based system for reasoning about high speed motion.
We consider the case of a robot at rest reacting to a small, fast approaching object at speeds higher than 15m/s.
We highlight the results of our system to a toy dart moving at 23.4m/s with a 24.73deg error in $theta$, 18.4mm average discretized radius prediction error, and 25.03% median time to collision prediction error.
arXiv Detail & Related papers (2020-11-18T17:55:07Z) - PIDNet: An Efficient Network for Dynamic Pedestrian Intrusion Detection [22.316826418265666]
Vision-based dynamic pedestrian intrusion detection (PID), judging whether pedestrians intrude an area-of-interest (AoI) by a moving camera, is an important task in mobile surveillance.
We propose a novel and efficient multi-task deep neural network, PIDNet, to solve this problem.
PIDNet is mainly designed by considering two factors: accurately segmenting the dynamically changing AoIs from a video frame captured by the moving camera and quickly detecting pedestrians from the generated AoI-contained areas.
arXiv Detail & Related papers (2020-09-01T09:34:43Z) - A Prospective Study on Sequence-Driven Temporal Sampling and Ego-Motion
Compensation for Action Recognition in the EPIC-Kitchens Dataset [68.8204255655161]
Action recognition is one of the top-challenging research fields in computer vision.
ego-motion recorded sequences have become of important relevance.
The proposed method aims to cope with it by estimating this ego-motion or camera motion.
arXiv Detail & Related papers (2020-08-26T14:44:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.