Dual Semantic Fusion Network for Video Object Detection
- URL: http://arxiv.org/abs/2009.07498v1
- Date: Wed, 16 Sep 2020 06:49:17 GMT
- Title: Dual Semantic Fusion Network for Video Object Detection
- Authors: Lijian Lin, Haosheng Chen, Honglun Zhang, Jun Liang, Yu Li, Ying Shan,
Hanzi Wang
- Abstract summary: We propose a dual semantic fusion network (DSFNet) to fully exploit both frame-level and instance-level semantics in a unified fusion framework without external guidance.
The proposed DSFNet can generate more robust features through the multi-granularity fusion and avoid being affected by the instability of external guidance.
- Score: 35.175552056938635
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video object detection is a tough task due to the deteriorated quality of
video sequences captured under complex environments. Currently, this area is
dominated by a series of feature enhancement based methods, which distill
beneficial semantic information from multiple frames and generate enhanced
features through fusing the distilled information. However, the distillation
and fusion operations are usually performed at either frame level or instance
level with external guidance using additional information, such as optical flow
and feature memory. In this work, we propose a dual semantic fusion network
(abbreviated as DSFNet) to fully exploit both frame-level and instance-level
semantics in a unified fusion framework without external guidance. Moreover, we
introduce a geometric similarity measure into the fusion process to alleviate
the influence of information distortion caused by noise. As a result, the
proposed DSFNet can generate more robust features through the multi-granularity
fusion and avoid being affected by the instability of external guidance. To
evaluate the proposed DSFNet, we conduct extensive experiments on the ImageNet
VID dataset. Notably, the proposed dual semantic fusion network achieves, to
the best of our knowledge, the best performance of 84.1\% mAP among the current
state-of-the-art video object detectors with ResNet-101 and 85.4\% mAP with
ResNeXt-101 without using any post-processing steps.
Related papers
- Embracing Events and Frames with Hierarchical Feature Refinement Network for Object Detection [17.406051477690134]
Event cameras output sparse and asynchronous events, providing a potential solution to solve these problems.
We propose a novel hierarchical feature refinement network for event-frame fusion.
Our method exhibits significantly better robustness when introducing 15 different corruption types to the frame images.
arXiv Detail & Related papers (2024-07-17T14:09:46Z) - Fusion-Mamba for Cross-modality Object Detection [63.56296480951342]
Cross-modality fusing information from different modalities effectively improves object detection performance.
We design a Fusion-Mamba block (FMB) to map cross-modal features into a hidden state space for interaction.
Our proposed approach outperforms the state-of-the-art methods on $m$AP with 5.9% on $M3FD$ and 4.9% on FLIR-Aligned datasets.
arXiv Detail & Related papers (2024-04-14T05:28:46Z) - Mutual-Guided Dynamic Network for Image Fusion [51.615598671899335]
We propose a novel mutual-guided dynamic network (MGDN) for image fusion, which allows for effective information utilization across different locations and inputs.
Experimental results on five benchmark datasets demonstrate that our proposed method outperforms existing methods on four image fusion tasks.
arXiv Detail & Related papers (2023-08-24T03:50:37Z) - An Interactively Reinforced Paradigm for Joint Infrared-Visible Image
Fusion and Saliency Object Detection [59.02821429555375]
This research focuses on the discovery and localization of hidden objects in the wild and serves unmanned systems.
Through empirical analysis, infrared and visible image fusion (IVIF) enables hard-to-find objects apparent.
multimodal salient object detection (SOD) accurately delineates the precise spatial location of objects within the picture.
arXiv Detail & Related papers (2023-05-17T06:48:35Z) - CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for
Multi-Modality Image Fusion [138.40422469153145]
We propose a novel Correlation-Driven feature Decomposition Fusion (CDDFuse) network.
We show that CDDFuse achieves promising results in multiple fusion tasks, including infrared-visible image fusion and medical image fusion.
arXiv Detail & Related papers (2022-11-26T02:40:28Z) - PSNet: Parallel Symmetric Network for Video Salient Object Detection [85.94443548452729]
We propose a VSOD network with up and down parallel symmetry, named PSNet.
Two parallel branches with different dominant modalities are set to achieve complete video saliency decoding.
arXiv Detail & Related papers (2022-10-12T04:11:48Z) - Perception-aware Multi-sensor Fusion for 3D LiDAR Semantic Segmentation [59.42262859654698]
3D semantic segmentation is important in scene understanding for many applications, such as auto-driving and robotics.
Existing fusion-based methods may not achieve promising performance due to vast difference between two modalities.
In this work, we investigate a collaborative fusion scheme called perception-aware multi-sensor fusion (PMF) to exploit perceptual information from two modalities.
arXiv Detail & Related papers (2021-06-21T10:47:26Z) - Progressive Multi-scale Fusion Network for RGB-D Salient Object
Detection [9.099589602551575]
We discuss about the advantages of the so-called progressive multi-scale fusion method and propose a mask-guided feature aggregation module.
The proposed framework can effectively combine the two features of different modalities and alleviate the impact of erroneous depth features.
We further introduce a mask-guided refinement module(MGRM) to complement the high-level semantic features and reduce the irrelevant features from multi-scale fusion.
arXiv Detail & Related papers (2021-06-07T20:02:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.