SWTF: Sparse Weighted Temporal Fusion for Drone-Based Activity
Recognition
- URL: http://arxiv.org/abs/2211.05531v1
- Date: Thu, 10 Nov 2022 12:45:43 GMT
- Title: SWTF: Sparse Weighted Temporal Fusion for Drone-Based Activity
Recognition
- Authors: Santosh Kumar Yadav, Esha Pahwa, Achleshwar Luthra, Kamlesh Tiwari,
Hari Mohan Pandey, Peter Corcoran
- Abstract summary: Drone-camera based human activity recognition (HAR) has received significant attention from the computer vision research community.
We propose a novel Sparse Weighted Temporal Fusion (SWTF) module to utilize sparsely sampled video frames.
The proposed model has received an accuracy of 72.76%, 92.56%, and 78.86% on the respective datasets.
- Score: 2.7677069267434873
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Drone-camera based human activity recognition (HAR) has received significant
attention from the computer vision research community in the past few years. A
robust and efficient HAR system has a pivotal role in fields like video
surveillance, crowd behavior analysis, sports analysis, and human-computer
interaction. What makes it challenging are the complex poses, understanding
different viewpoints, and the environmental scenarios where the action is
taking place. To address such complexities, in this paper, we propose a novel
Sparse Weighted Temporal Fusion (SWTF) module to utilize sparsely sampled video
frames for obtaining global weighted temporal fusion outcome. The proposed SWTF
is divided into two components. First, a temporal segment network that sparsely
samples a given set of frames. Second, weighted temporal fusion, that
incorporates a fusion of feature maps derived from optical flow, with raw RGB
images. This is followed by base-network, which comprises a convolutional
neural network module along with fully connected layers that provide us with
activity recognition. The SWTF network can be used as a plug-in module to the
existing deep CNN architectures, for optimizing them to learn temporal
information by eliminating the need for a separate temporal stream. It has been
evaluated on three publicly available benchmark datasets, namely Okutama,
MOD20, and Drone-Action. The proposed model has received an accuracy of 72.76%,
92.56%, and 78.86% on the respective datasets thereby surpassing the previous
state-of-the-art performances by a significant margin.
Related papers
- DroneAttention: Sparse Weighted Temporal Attention for Drone-Camera
Based Activity Recognition [2.705905918316948]
Human activity recognition (HAR) using drone-mounted cameras has attracted considerable interest from the computer vision research community in recent years.
We propose a novel Sparse Weighted Temporal Attention (SWTA) module to utilize sparsely sampled video frames for obtaining global weighted temporal attention.
The proposed model has received an accuracy of 72.76%, 92.56%, and 78.86% on the respective datasets.
arXiv Detail & Related papers (2022-12-07T00:33:40Z) - FuTH-Net: Fusing Temporal Relations and Holistic Features for Aerial
Video Classification [49.06447472006251]
We propose a novel deep neural network, termed FuTH-Net, to model not only holistic features, but also temporal relations for aerial video classification.
Our model is evaluated on two aerial video classification datasets, ERA and Drone-Action, and achieves the state-of-the-art results.
arXiv Detail & Related papers (2022-09-22T21:15:58Z) - Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for
Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query.
Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information.
We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z) - MS-TCT: Multi-Scale Temporal ConvTransformer for Action Detection [37.25262046781015]
Action detection is an essential and challenging task, especially for densely labelled datasets of untrimmed videos.
We propose a novel ConvTransformer network for action detection that efficiently captures both short-term and long-term temporal information.
Our network outperforms the state-of-the-art methods on all three datasets.
arXiv Detail & Related papers (2021-12-07T18:57:37Z) - Spatio-Temporal Self-Attention Network for Video Saliency Prediction [13.873682190242365]
3D convolutional neural networks have achieved promising results for video tasks in computer vision.
We propose a novel Spatio-Temporal Self-Temporal Self-Attention 3 Network (STSANet) for video saliency prediction.
arXiv Detail & Related papers (2021-08-24T12:52:47Z) - MFGNet: Dynamic Modality-Aware Filter Generation for RGB-T Tracking [72.65494220685525]
We propose a new dynamic modality-aware filter generation module (named MFGNet) to boost the message communication between visible and thermal data.
We generate dynamic modality-aware filters with two independent networks. The visible and thermal filters will be used to conduct a dynamic convolutional operation on their corresponding input feature maps respectively.
To address issues caused by heavy occlusion, fast motion, and out-of-view, we propose to conduct a joint local and global search by exploiting a new direction-aware target-driven attention mechanism.
arXiv Detail & Related papers (2021-07-22T03:10:51Z) - DS-Net: Dynamic Spatiotemporal Network for Video Salient Object
Detection [78.04869214450963]
We propose a novel dynamic temporal-temporal network (DSNet) for more effective fusion of temporal and spatial information.
We show that the proposed method achieves superior performance than state-of-the-art algorithms.
arXiv Detail & Related papers (2020-12-09T06:42:30Z) - TAM: Temporal Adaptive Module for Video Recognition [60.83208364110288]
temporal adaptive module (bf TAM) generates video-specific temporal kernels based on its own feature map.
Experiments on Kinetics-400 and Something-Something datasets demonstrate that our TAM outperforms other temporal modeling methods consistently.
arXiv Detail & Related papers (2020-05-14T08:22:45Z) - Temporal Pyramid Network for Action Recognition [129.12076009042622]
We propose a generic Temporal Pyramid Network (TPN) at the feature-level, which can be flexibly integrated into 2D or 3D backbone networks.
TPN shows consistent improvements over other challenging baselines on several action recognition datasets.
arXiv Detail & Related papers (2020-04-07T17:17:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.