DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection
- URL: http://arxiv.org/abs/2203.08195v1
- Date: Tue, 15 Mar 2022 18:46:06 GMT
- Title: DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection
- Authors: Yingwei Li, Adams Wei Yu, Tianjian Meng, Ben Caine, Jiquan Ngiam,
Daiyi Peng, Junyang Shen, Bo Wu, Yifeng Lu, Denny Zhou, Quoc V. Le, Alan
Yuille, Mingxing Tan
- Abstract summary: Lidars and cameras are critical sensors that provide complementary information for 3D detection in autonomous driving.
We develop a family of generic multi-modal 3D detection models named DeepFusion, which is more accurate than previous methods.
- Score: 83.18142309597984
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Lidars and cameras are critical sensors that provide complementary
information for 3D detection in autonomous driving. While prevalent multi-modal
methods simply decorate raw lidar point clouds with camera features and feed
them directly to existing 3D detection models, our study shows that fusing
camera features with deep lidar features instead of raw points, can lead to
better performance. However, as those features are often augmented and
aggregated, a key challenge in fusion is how to effectively align the
transformed features from two modalities. In this paper, we propose two novel
techniques: InverseAug that inverses geometric-related augmentations, e.g.,
rotation, to enable accurate geometric alignment between lidar points and image
pixels, and LearnableAlign that leverages cross-attention to dynamically
capture the correlations between image and lidar features during fusion. Based
on InverseAug and LearnableAlign, we develop a family of generic multi-modal 3D
detection models named DeepFusion, which is more accurate than previous
methods. For example, DeepFusion improves PointPillars, CenterPoint, and 3D-MAN
baselines on Pedestrian detection for 6.7, 8.9, and 6.2 LEVEL_2 APH,
respectively. Notably, our models achieve state-of-the-art performance on Waymo
Open Dataset, and show strong model robustness against input corruptions and
out-of-distribution data. Code will be publicly available at
https://github.com/tensorflow/lingvo/tree/master/lingvo/.
Related papers
- Progressive Multi-Modal Fusion for Robust 3D Object Detection [12.048303829428452]
Existing methods perform sensor fusion in a single view by projecting features from both modalities either in Bird's Eye View (BEV) or Perspective View (PV)
We propose ProFusion3D, a progressive fusion framework that combines features in both BEV and PV at both intermediate and object query levels.
Our architecture hierarchically fuses local and global features, enhancing the robustness of 3D object detection.
arXiv Detail & Related papers (2024-10-09T22:57:47Z) - 3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features [70.50665869806188]
3DiffTection is a state-of-the-art method for 3D object detection from single images.
We fine-tune a diffusion model to perform novel view synthesis conditioned on a single image.
We further train the model on target data with detection supervision.
arXiv Detail & Related papers (2023-11-07T23:46:41Z) - Multi-Modal 3D Object Detection by Box Matching [109.43430123791684]
We propose a novel Fusion network by Box Matching (FBMNet) for multi-modal 3D detection.
With the learned assignments between 3D and 2D object proposals, the fusion for detection can be effectively performed by combing their ROI features.
arXiv Detail & Related papers (2023-05-12T18:08:51Z) - PathFusion: Path-consistent Lidar-Camera Deep Feature Fusion [30.803450612746403]
We propose PathFusion as a solution to enable the alignment of semantically coherent LiDAR-camera deep feature fusion.
PathFusion introduces a path consistency loss at multiple stages within the network, encouraging the 2D backbone and its fusion path.
We observe an improvement of over 1.6% in mAP on the nuScenes test split consistently with and without testing-time data augmentations.
arXiv Detail & Related papers (2022-12-12T20:58:54Z) - Multi-Sem Fusion: Multimodal Semantic Fusion for 3D Object Detection [11.575945934519442]
LiDAR and camera fusion techniques are promising for achieving 3D object detection in autonomous driving.
Most multi-modal 3D object detection frameworks integrate semantic knowledge from 2D images into 3D LiDAR point clouds.
We propose a general multi-modal fusion framework Multi-Sem Fusion (MSF) to fuse the semantic information from both the 2D image and 3D points scene parsing results.
arXiv Detail & Related papers (2022-12-10T10:54:41Z) - 3D Dual-Fusion: Dual-Domain Dual-Query Camera-LiDAR Fusion for 3D Object
Detection [13.068266058374775]
We propose a novel camera-LiDAR fusion architecture called 3D Dual-Fusion.
The proposed method fuses the features of the camera-view and 3D voxel-view domain and models their interactions through deformable attention.
The results of an experimental evaluation show that the proposed camera-LiDAR fusion architecture achieved competitive performance on the KITTI and nuScenes datasets.
arXiv Detail & Related papers (2022-11-24T11:00:50Z) - MSMDFusion: Fusing LiDAR and Camera at Multiple Scales with Multi-Depth
Seeds for 3D Object Detection [89.26380781863665]
Fusing LiDAR and camera information is essential for achieving accurate and reliable 3D object detection in autonomous driving systems.
Recent approaches aim at exploring the semantic densities of camera features through lifting points in 2D camera images into 3D space for fusion.
We propose a novel framework that focuses on the multi-scale progressive interaction of the multi-granularity LiDAR and camera features.
arXiv Detail & Related papers (2022-09-07T12:29:29Z) - VPFNet: Improving 3D Object Detection with Virtual Point based LiDAR and
Stereo Data Fusion [62.24001258298076]
VPFNet is a new architecture that cleverly aligns and aggregates the point cloud and image data at the virtual' points.
Our VPFNet achieves 83.21% moderate 3D AP and 91.86% moderate BEV AP on the KITTI test set, ranking the 1st since May 21th, 2021.
arXiv Detail & Related papers (2021-11-29T08:51:20Z) - RoIFusion: 3D Object Detection from LiDAR and Vision [7.878027048763662]
We propose a novel fusion algorithm by projecting a set of 3D Region of Interests (RoIs) from the point clouds to the 2D RoIs of the corresponding the images.
Our approach achieves state-of-the-art performance on the KITTI 3D object detection challenging benchmark.
arXiv Detail & Related papers (2020-09-09T20:23:27Z) - Cross-Modality 3D Object Detection [63.29935886648709]
We present a novel two-stage multi-modal fusion network for 3D object detection.
The whole architecture facilitates two-stage fusion.
Our experiments on the KITTI dataset show that the proposed multi-stage fusion helps the network to learn better representations.
arXiv Detail & Related papers (2020-08-16T11:01:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.