STF: Spatio-Temporal Fusion Module for Improving Video Object Detection
- URL: http://arxiv.org/abs/2402.10752v1
- Date: Fri, 16 Feb 2024 15:19:39 GMT
- Title: STF: Spatio-Temporal Fusion Module for Improving Video Object Detection
- Authors: Noreen Anwar, Guillaume-Alexandre Bilodeau and Wassim Bouachir
- Abstract summary: Consive frames in a video contain redundancy, but they may also contain complementary information for the detection task.
We propose a-temporal fusion framework (STF) to leverage this complementary information.
The proposed-temporal fusion module leads to improved detection performance compared to baseline object detectors.
- Score: 7.213855322671065
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Consecutive frames in a video contain redundancy, but they may also contain
relevant complementary information for the detection task. The objective of our
work is to leverage this complementary information to improve detection.
Therefore, we propose a spatio-temporal fusion framework (STF). We first
introduce multi-frame and single-frame attention modules that allow a neural
network to share feature maps between nearby frames to obtain more robust
object representations. Second, we introduce a dual-frame fusion module that
merges feature maps in a learnable manner to improve them. Our evaluation is
conducted on three different benchmarks including video sequences of moving
road users. The performed experiments demonstrate that the proposed
spatio-temporal fusion module leads to improved detection performance compared
to baseline object detectors. Code is available at
https://github.com/noreenanwar/STF-module
Related papers
- LaneTCA: Enhancing Video Lane Detection with Temporal Context Aggregation [87.71768494466959]
LaneTCA bridges the individual video frames and explore how to effectively aggregate the temporal context.
We develop an accumulative attention module and an adjacent attention module to abstract the long-term and short-term temporal context.
The two modules are meticulously designed based on the transformer architecture.
arXiv Detail & Related papers (2024-08-25T14:46:29Z) - Feature Aggregation and Propagation Network for Camouflaged Object
Detection [42.33180748293329]
Camouflaged object detection (COD) aims to detect/segment camouflaged objects embedded in the environment.
Several COD methods have been developed, but they still suffer from unsatisfactory performance due to intrinsic similarities between foreground objects and background surroundings.
We propose a novel Feature Aggregation and propagation Network (FAP-Net) for camouflaged object detection.
arXiv Detail & Related papers (2022-12-02T05:54:28Z) - SWTF: Sparse Weighted Temporal Fusion for Drone-Based Activity
Recognition [2.7677069267434873]
Drone-camera based human activity recognition (HAR) has received significant attention from the computer vision research community.
We propose a novel Sparse Weighted Temporal Fusion (SWTF) module to utilize sparsely sampled video frames.
The proposed model has received an accuracy of 72.76%, 92.56%, and 78.86% on the respective datasets.
arXiv Detail & Related papers (2022-11-10T12:45:43Z) - A Unified Transformer Framework for Group-based Segmentation:
Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection [59.21990697929617]
Humans tend to mine objects by learning from a group of images or several frames of video since we live in a dynamic world.
Previous approaches design different networks on similar tasks separately, and they are difficult to apply to each other.
We introduce a unified framework to tackle these issues, term as UFO (UnifiedObject Framework for Co-Object Framework)
arXiv Detail & Related papers (2022-03-09T13:35:19Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - FFAVOD: Feature Fusion Architecture for Video Object Detection [11.365829102707014]
We propose FFAVOD, standing for feature fusion architecture for video object detection.
We first introduce a novel video object detection architecture that allows a network to share feature maps between nearby frames.
We show that using the proposed architecture and the fusion module can improve the performance of three base object detectors on two object detection benchmarks containing sequences of moving road users.
arXiv Detail & Related papers (2021-09-15T13:53:21Z) - TF-Blender: Temporal Feature Blender for Video Object Detection [6.369234802164117]
Video objection detection is a challenging task because isolated video frames may encounter appearance deterioration.
We propose TF-Blender, which includes three modules: 1) Temporal relation mod-els the relations between the current frame and its neighboring frames to preserve spatial information.
For its simplicity, TF-Blender can be effortlessly plugged into any detection network to improve detection behavior.
arXiv Detail & Related papers (2021-08-12T16:01:34Z) - Full-Duplex Strategy for Video Object Segmentation [141.43983376262815]
Full- Strategy Network (FSNet) is a novel framework for video object segmentation (VOS)
Our FSNet performs the crossmodal feature-passing (i.e., transmission and receiving) simultaneously before fusion decoding stage.
We show that our FSNet outperforms other state-of-the-arts for both the VOS and video salient object detection tasks.
arXiv Detail & Related papers (2021-08-06T14:50:50Z) - Online Multiple Object Tracking with Cross-Task Synergy [120.70085565030628]
We propose a novel unified model with synergy between position prediction and embedding association.
The two tasks are linked by temporal-aware target attention and distractor attention, as well as identity-aware memory aggregation model.
arXiv Detail & Related papers (2021-04-01T10:19:40Z) - F2Net: Learning to Focus on the Foreground for Unsupervised Video Object
Segmentation [61.74261802856947]
We propose a novel Focus on Foreground Network (F2Net), which delves into the intra-inter frame details for the foreground objects.
Our proposed network consists of three main parts: Siamese Module, Center Guiding Appearance Diffusion Module, and Dynamic Information Fusion Module.
Experiments on DAVIS2016, Youtube-object, and FBMS datasets show that our proposed F2Net achieves the state-of-the-art performance with significant improvement.
arXiv Detail & Related papers (2020-12-04T11:30:50Z) - Multi-View Adaptive Fusion Network for 3D Object Detection [14.506796247331584]
3D object detection based on LiDAR-camera fusion is becoming an emerging research theme for autonomous driving.
We propose a single-stage multi-view fusion framework that takes LiDAR bird's-eye view, LiDAR range view and camera view images as inputs for 3D object detection.
We design an end-to-end learnable network named MVAF-Net to integrate these two components.
arXiv Detail & Related papers (2020-11-02T00:06:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.