Spatio-Contextual Deep Network Based Multimodal Pedestrian Detection For
Autonomous Driving
- URL: http://arxiv.org/abs/2105.12713v1
- Date: Wed, 26 May 2021 17:50:36 GMT
- Title: Spatio-Contextual Deep Network Based Multimodal Pedestrian Detection For
Autonomous Driving
- Authors: Kinjal Dasgupta, Arindam Das, Sudip Das, Ujjwal Bhattacharya and
Senthil Yogamani
- Abstract summary: This paper proposes an end-to-end multimodal fusion model for pedestrian detection using RGB and thermal images.
Its novel deep network architecture is capable of exploiting multimodal input efficiently.
The results on each of them improved the respective state-the-art performance.
- Score: 1.2599533416395765
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pedestrian Detection is the most critical module of an Autonomous Driving
system. Although a camera is commonly used for this purpose, its quality
degrades severely in low-light night time driving scenarios. On the other hand,
the quality of a thermal camera image remains unaffected in similar conditions.
This paper proposes an end-to-end multimodal fusion model for pedestrian
detection using RGB and thermal images. Its novel spatio-contextual deep
network architecture is capable of exploiting the multimodal input efficiently.
It consists of two distinct deformable ResNeXt-50 encoders for feature
extraction from the two modalities. Fusion of these two encoded features takes
place inside a multimodal feature embedding module (MuFEm) consisting of
several groups of a pair of Graph Attention Network and a feature fusion unit.
The output of the last feature fusion unit of MuFEm is subsequently passed to
two CRFs for their spatial refinement. Further enhancement of the features is
achieved by applying channel-wise attention and extraction of contextual
information with the help of four RNNs traversing in four different directions.
Finally, these feature maps are used by a single-stage decoder to generate the
bounding box of each pedestrian and the score map. We have performed extensive
experiments of the proposed framework on three publicly available multimodal
pedestrian detection benchmark datasets, namely KAIST, CVC-14, and UTokyo. The
results on each of them improved the respective state-of-the-art performance. A
short video giving an overview of this work along with its qualitative results
can be seen at https://youtu.be/FDJdSifuuCs.
Related papers
- A Generalized Multi-Modal Fusion Detection Framework [7.951044844083936]
LiDAR point clouds have become the most common data source in autonomous driving.
Due to the sparsity of point clouds, accurate and reliable detection cannot be achieved in specific scenarios.
We propose a generic 3D detection framework called MMFusion, using multi-modal features.
arXiv Detail & Related papers (2023-03-13T12:38:07Z) - CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for
Multi-Modality Image Fusion [138.40422469153145]
We propose a novel Correlation-Driven feature Decomposition Fusion (CDDFuse) network.
We show that CDDFuse achieves promising results in multiple fusion tasks, including infrared-visible image fusion and medical image fusion.
arXiv Detail & Related papers (2022-11-26T02:40:28Z) - HRFuser: A Multi-resolution Sensor Fusion Architecture for 2D Object
Detection [0.0]
We propose HRFuser, a modular architecture for multi-modal 2D object detection.
It fuses multiple sensors in a multi-resolution fashion and scales to an arbitrary number of input modalities.
We demonstrate via experiments on nuScenes and the adverse conditions DENSE datasets that our model effectively leverages complementary features from additional modalities.
arXiv Detail & Related papers (2022-06-30T09:40:05Z) - Weakly Aligned Feature Fusion for Multimodal Object Detection [52.15436349488198]
multimodal data often suffer from the position shift problem, i.e., the image pair is not strictly aligned.
This problem makes it difficult to fuse multimodal features and puzzles the convolutional neural network (CNN) training.
In this article, we propose a general multimodal detector named aligned region CNN (AR-CNN) to tackle the position shift problem.
arXiv Detail & Related papers (2022-04-21T02:35:23Z) - Target-aware Dual Adversarial Learning and a Multi-scenario
Multi-Modality Benchmark to Fuse Infrared and Visible for Object Detection [65.30079184700755]
This study addresses the issue of fusing infrared and visible images that appear differently for object detection.
Previous approaches discover commons underlying the two modalities and fuse upon the common space either by iterative optimization or deep networks.
This paper proposes a bilevel optimization formulation for the joint problem of fusion and detection, and then unrolls to a target-aware Dual Adversarial Learning (TarDAL) network for fusion and a commonly used detection network.
arXiv Detail & Related papers (2022-03-30T11:44:56Z) - DDU-Net: Dual-Decoder-U-Net for Road Extraction Using High-Resolution
Remote Sensing Images [19.07341794770722]
An enhanced deep neural network model termed Dual-Decoder-U-Net (DDU-Net) is proposed in this paper.
The proposed model outperforms the state-of-the-art DenseUNet, DeepLabv3+ and D-LinkNet by 6.5%, 3.3%, and 2.1% in the mean Intersection over Union (mIoU) and by 4%, 4.8%, and 3.1% in the F1 score, respectively.
arXiv Detail & Related papers (2022-01-18T05:27:49Z) - MBDF-Net: Multi-Branch Deep Fusion Network for 3D Object Detection [17.295359521427073]
We propose a Multi-Branch Deep Fusion Network (MBDF-Net) for 3D object detection.
In the first stage, our multi-branch feature extraction network utilizes Adaptive Attention Fusion modules to produce cross-modal fusion features from single-modal semantic features.
In the second stage, we use a region of interest (RoI) -pooled fusion module to generate enhanced local features for refinement.
arXiv Detail & Related papers (2021-08-29T15:40:15Z) - MFGNet: Dynamic Modality-Aware Filter Generation for RGB-T Tracking [72.65494220685525]
We propose a new dynamic modality-aware filter generation module (named MFGNet) to boost the message communication between visible and thermal data.
We generate dynamic modality-aware filters with two independent networks. The visible and thermal filters will be used to conduct a dynamic convolutional operation on their corresponding input feature maps respectively.
To address issues caused by heavy occlusion, fast motion, and out-of-view, we propose to conduct a joint local and global search by exploiting a new direction-aware target-driven attention mechanism.
arXiv Detail & Related papers (2021-07-22T03:10:51Z) - EPMF: Efficient Perception-aware Multi-sensor Fusion for 3D Semantic Segmentation [62.210091681352914]
We study multi-sensor fusion for 3D semantic segmentation for many applications, such as autonomous driving and robotics.
In this work, we investigate a collaborative fusion scheme called perception-aware multi-sensor fusion (PMF)
We propose a two-stream network to extract features from the two modalities separately. The extracted features are fused by effective residual-based fusion modules.
arXiv Detail & Related papers (2021-06-21T10:47:26Z) - Anchor-free Small-scale Multispectral Pedestrian Detection [88.7497134369344]
We propose a method for effective and efficient multispectral fusion of the two modalities in an adapted single-stage anchor-free base architecture.
We aim at learning pedestrian representations based on object center and scale rather than direct bounding box predictions.
Results show our method's effectiveness in detecting small-scaled pedestrians.
arXiv Detail & Related papers (2020-08-19T13:13:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.