FuTH-Net: Fusing Temporal Relations and Holistic Features for Aerial
Video Classification
- URL: http://arxiv.org/abs/2209.11316v1
- Date: Thu, 22 Sep 2022 21:15:58 GMT
- Title: FuTH-Net: Fusing Temporal Relations and Holistic Features for Aerial
Video Classification
- Authors: Pu Jin, Lichao Mou, Yuansheng Hua, Gui-Song Xia, Xiao Xiang Zhu
- Abstract summary: We propose a novel deep neural network, termed FuTH-Net, to model not only holistic features, but also temporal relations for aerial video classification.
Our model is evaluated on two aerial video classification datasets, ERA and Drone-Action, and achieves the state-of-the-art results.
- Score: 49.06447472006251
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Unmanned aerial vehicles (UAVs) are now widely applied to data acquisition
due to its low cost and fast mobility. With the increasing volume of aerial
videos, the demand for automatically parsing these videos is surging. To
achieve this, current researches mainly focus on extracting a holistic feature
with convolutions along both spatial and temporal dimensions. However, these
methods are limited by small temporal receptive fields and cannot adequately
capture long-term temporal dependencies which are important for describing
complicated dynamics. In this paper, we propose a novel deep neural network,
termed FuTH-Net, to model not only holistic features, but also temporal
relations for aerial video classification. Furthermore, the holistic features
are refined by the multi-scale temporal relations in a novel fusion module for
yielding more discriminative video representations. More specially, FuTH-Net
employs a two-pathway architecture: (1) a holistic representation pathway to
learn a general feature of both frame appearances and shortterm temporal
variations and (2) a temporal relation pathway to capture multi-scale temporal
relations across arbitrary frames, providing long-term temporal dependencies.
Afterwards, a novel fusion module is proposed to spatiotemporal integrate the
two features learned from the two pathways. Our model is evaluated on two
aerial video classification datasets, ERA and Drone-Action, and achieves the
state-of-the-art results. This demonstrates its effectiveness and good
generalization capacity across different recognition tasks (event
classification and human action recognition). To facilitate further research,
we release the code at https://gitlab.lrz.de/ai4eo/reasoning/futh-net.
Related papers
- ASF-Net: Robust Video Deraining via Temporal Alignment and Online
Adaptive Learning [47.10392889695035]
We propose a new computational paradigm, Alignment-Shift-Fusion Network (ASF-Net), which incorporates a temporal shift module.
We construct a LArge-scale RAiny video dataset (LARA) which supports the development of this community.
Our proposed approach exhibits superior performance in three benchmarks and compelling visual quality in real-world scenarios.
arXiv Detail & Related papers (2023-09-02T14:50:13Z) - Deeply-Coupled Convolution-Transformer with Spatial-temporal
Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID.
Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z) - Revisiting the Spatial and Temporal Modeling for Few-shot Action
Recognition [16.287968292213563]
We propose SloshNet, a new framework that revisits the spatial and temporal modeling for few-shot action recognition in a finer manner.
We extensively validate the proposed SloshNet on four few-shot action recognition datasets, including Something-Something V2, Kinetics, UCF101, and HMDB51.
arXiv Detail & Related papers (2023-01-19T08:34:04Z) - SWTF: Sparse Weighted Temporal Fusion for Drone-Based Activity
Recognition [2.7677069267434873]
Drone-camera based human activity recognition (HAR) has received significant attention from the computer vision research community.
We propose a novel Sparse Weighted Temporal Fusion (SWTF) module to utilize sparsely sampled video frames.
The proposed model has received an accuracy of 72.76%, 92.56%, and 78.86% on the respective datasets.
arXiv Detail & Related papers (2022-11-10T12:45:43Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z) - Temporal Memory Relation Network for Workflow Recognition from Surgical
Video [53.20825496640025]
We propose a novel end-to-end temporal memory relation network (TMNet) for relating long-range and multi-scale temporal patterns.
We have extensively validated our approach on two benchmark surgical video datasets.
arXiv Detail & Related papers (2021-03-30T13:20:26Z) - Coarse-Fine Networks for Temporal Activity Detection in Videos [45.03545172714305]
We introduce 'Co-Fine Networks', a two-stream architecture which benefits from different abstractions of temporal resolution to learn better video representations for long-term motion.
We show that our method can outperform the state-of-the-arts for action detection in public datasets with a significantly reduced compute and memory footprint.
arXiv Detail & Related papers (2021-03-01T20:48:01Z) - GTA: Global Temporal Attention for Video Action Understanding [51.476605514802806]
We introduce Global Temporal Attention (AGT), which performs global temporal attention on top of spatial attention in a decoupled manner.
Tests on 2D and 3D networks demonstrate that our approach consistently enhances temporal modeling and provides state-of-the-art performance on three video action recognition datasets.
arXiv Detail & Related papers (2020-12-15T18:58:21Z) - Exploring Rich and Efficient Spatial Temporal Interactions for Real Time
Video Salient Object Detection [87.32774157186412]
Main stream methods formulate their video saliency mainly from two independent venues, i.e., the spatial and temporal branches.
In this paper, we propose atemporal network to achieve such improvement in a full interactive fashion.
Our method is easy to implement yet effective, achieving high quality video saliency detection in real-time speed with 50 FPS.
arXiv Detail & Related papers (2020-08-07T03:24:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.