Weakly Aligned Feature Fusion for Multimodal Object Detection
- URL: http://arxiv.org/abs/2204.09848v1
- Date: Thu, 21 Apr 2022 02:35:23 GMT
- Title: Weakly Aligned Feature Fusion for Multimodal Object Detection
- Authors: Lu Zhang, Zhiyong Liu, Xiangyu Zhu, Zhan Song, Xu Yang, Zhen Lei, Hong
Qiao
- Abstract summary: multimodal data often suffer from the position shift problem, i.e., the image pair is not strictly aligned.
This problem makes it difficult to fuse multimodal features and puzzles the convolutional neural network (CNN) training.
In this article, we propose a general multimodal detector named aligned region CNN (AR-CNN) to tackle the position shift problem.
- Score: 52.15436349488198
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To achieve accurate and robust object detection in the real-world scenario,
various forms of images are incorporated, such as color, thermal, and depth.
However, multimodal data often suffer from the position shift problem, i.e.,
the image pair is not strictly aligned, making one object has different
positions in different modalities. For the deep learning method, this problem
makes it difficult to fuse multimodal features and puzzles the convolutional
neural network (CNN) training. In this article, we propose a general multimodal
detector named aligned region CNN (AR-CNN) to tackle the position shift
problem. First, a region feature (RF) alignment module with adjacent similarity
constraint is designed to consistently predict the position shift between two
modalities and adaptively align the cross-modal RFs. Second, we propose a novel
region of interest (RoI) jitter strategy to improve the robustness to
unexpected shift patterns. Third, we present a new multimodal feature fusion
method that selects the more reliable feature and suppresses the less useful
one via feature reweighting. In addition, by locating bounding boxes in both
modalities and building their relationships, we provide novel multimodal
labeling named KAIST-Paired. Extensive experiments on 2-D and 3-D object
detection, RGB-T, and RGB-D datasets demonstrate the effectiveness and
robustness of our method.
Related papers
- Mutual-Guided Dynamic Network for Image Fusion [51.615598671899335]
We propose a novel mutual-guided dynamic network (MGDN) for image fusion, which allows for effective information utilization across different locations and inputs.
Experimental results on five benchmark datasets demonstrate that our proposed method outperforms existing methods on four image fusion tasks.
arXiv Detail & Related papers (2023-08-24T03:50:37Z) - Object Segmentation by Mining Cross-Modal Semantics [68.88086621181628]
We propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features.
Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision.
arXiv Detail & Related papers (2023-05-17T14:30:11Z) - Multi-Modal 3D Object Detection by Box Matching [109.43430123791684]
We propose a novel Fusion network by Box Matching (FBMNet) for multi-modal 3D detection.
With the learned assignments between 3D and 2D object proposals, the fusion for detection can be effectively performed by combing their ROI features.
arXiv Detail & Related papers (2023-05-12T18:08:51Z) - Multimodal Industrial Anomaly Detection via Hybrid Fusion [59.16333340582885]
We propose a novel multimodal anomaly detection method with hybrid fusion scheme.
Our model outperforms the state-of-the-art (SOTA) methods on both detection and segmentation precision on MVTecD-3 AD dataset.
arXiv Detail & Related papers (2023-03-01T15:48:27Z) - Hierarchical Cross-modal Transformer for RGB-D Salient Object Detection [6.385624548310884]
We propose the Hierarchical Cross-modal Transformer (HCT), a new multi-modal transformer, to tackle this problem.
Unlike previous multi-modal transformers that directly connecting all patches from two modalities, we explore the cross-modal complementarity hierarchically.
We present a Feature Pyramid module for Transformer (FPT) to boost informative cross-scale integration as well as a consistency-complementarity module to disentangle the multi-modal integration path.
arXiv Detail & Related papers (2023-02-16T03:23:23Z) - HiDAnet: RGB-D Salient Object Detection via Hierarchical Depth Awareness [2.341385717236931]
We propose a novel Hierarchical Depth Awareness network (HiDAnet) for RGB-D saliency detection.
Our motivation comes from the observation that the multi-granularity properties of geometric priors correlate well with the neural network hierarchies.
Our HiDAnet performs favorably over the state-of-the-art methods by large margins.
arXiv Detail & Related papers (2023-01-18T10:00:59Z) - Unifying Voxel-based Representation with Transformer for 3D Object
Detection [143.91910747605107]
We present a unified framework for multi-modality 3D object detection, named UVTR.
The proposed method aims to unify multi-modality representations in the voxel space for accurate and robust single- or cross-modality 3D detection.
UVTR achieves leading performance in the nuScenes test set with 69.7%, 55.1%, and 71.1% NDS for LiDAR, camera, and multi-modality inputs, respectively.
arXiv Detail & Related papers (2022-06-01T17:02:40Z) - M2RNet: Multi-modal and Multi-scale Refined Network for RGB-D Salient
Object Detection [1.002712867721496]
Methods based on RGB-D often suffer from the incompatibility of multi-modal feature fusion and the insufficiency of multi-scale feature aggregation.
We propose a novel multi-modal and multi-scale refined network (M2RNet)
Three essential components are presented in this network.
arXiv Detail & Related papers (2021-09-16T12:15:40Z) - Multi-Modal Pedestrian Detection with Large Misalignment Based on
Modal-Wise Regression and Multi-Modal IoU [15.59089347915245]
The combined use of multiple modalities enables accurate pedestrian detection under poor lighting conditions.
The vital assumption for the combination use is that there is no or only a weak misalignment between the two modalities.
In this paper, we propose a multi-modal Faster-RCNN that is robust against large misalignment.
arXiv Detail & Related papers (2021-07-23T12:58:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.