Related papers: BiCo-Fusion: Bidirectional Complementary LiDAR-Camera Fusion for Semantic- and Spatial-Aware 3D Object Detection

BiCo-Fusion: Bidirectional Complementary LiDAR-Camera Fusion for Semantic- and Spatial-Aware 3D Object Detection

URL: http://arxiv.org/abs/2406.19048v2
Date: Sun, 01 Dec 2024 07:07:02 GMT
Title: BiCo-Fusion: Bidirectional Complementary LiDAR-Camera Fusion for Semantic- and Spatial-Aware 3D Object Detection
Authors: Yang Song, Lin Wang,
Abstract summary: New trend is to fuse multi-modal inputs, i.e., LiDAR and camera.<n>LiDAR features struggle with detailed semantic information and the camera lacks accurate 3D spatial information.<n>BiCo-Fusion can achieve robust semantic- and spatial-aware 3D object detection.
Score: 10.321117046185321
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: 3D object detection is an important task that has been widely applied in autonomous driving. To perform this task, a new trend is to fuse multi-modal inputs, i.e., LiDAR and camera. Under such a trend, recent methods fuse these two modalities by unifying them in the same 3D space. However, during direct fusion in a unified space, the drawbacks of both modalities (LiDAR features struggle with detailed semantic information and the camera lacks accurate 3D spatial information) are also preserved, diluting semantic and spatial awareness of the final unified representation. To address the issue, this letter proposes a novel bidirectional complementary LiDAR-camera fusion framework, called BiCo-Fusion that can achieve robust semantic- and spatial-aware 3D object detection. The key insight is to fuse LiDAR and camera features in a bidirectional complementary way to enhance the semantic awareness of the LiDAR and the 3D spatial awareness of the camera. The enhanced features from both modalities are then adaptively fused to build a semantic- and spatial-aware unified representation. Specifically, we introduce Pre-Fusion consisting of a Voxel Enhancement Module (VEM) to enhance the semantic awareness of voxel features from 2D camera features and Image Enhancement Module (IEM) to enhance the 3D spatial awareness of camera features from 3D voxel features. We then introduce Unified Fusion (U-Fusion) to adaptively fuse the enhanced features from the last stage to build a unified representation. Extensive experiments demonstrate the superiority of our BiCo-Fusion against the prior arts. Project page: https://t-ys.github.io/BiCo-Fusion/.

Related papers

Talk2PC: Enhancing 3D Visual Grounding through LiDAR and Radar Point Clouds Fusion for Autonomous Driving [25.28104119280405]
We propose TPCNet, the first outdoor 3D visual grounding model upon the paradigm of prompt-guided point cloud sensor combination. To balance the features of these two sensors required by the prompt, we have designed a multi-fusion paradigm called Two-Stage Heterogeneous Modal Adaptive Fusion. Our experiments have demonstrated that TPCNet achieves the state-of-the-art performance on both the Talk2Radar and Talk2Car datasets.
arXiv Detail & Related papers (2025-03-11T11:48:27Z)
Progressive Multi-Modal Fusion for Robust 3D Object Detection [12.048303829428452]
Existing methods perform sensor fusion in a single view by projecting features from both modalities either in Bird's Eye View (BEV) or Perspective View (PV) We propose ProFusion3D, a progressive fusion framework that combines features in both BEV and PV at both intermediate and object query levels. Our architecture hierarchically fuses local and global features, enhancing the robustness of 3D object detection.
arXiv Detail & Related papers (2024-10-09T22:57:47Z)
SemanticBEVFusion: Rethink LiDAR-Camera Fusion in Unified Bird's-Eye View Representation for 3D Object Detection [14.706717531900708]
LiDAR and camera are two essential sensors for 3D object detection in autonomous driving. Recent methods focus on point-level fusion which paints the LiDAR point cloud with camera features in the perspective view. We present SemanticBEVFusion to deeply fuse camera features with LiDAR features in a unified BEV representation.
arXiv Detail & Related papers (2022-12-09T05:48:58Z)
3D Dual-Fusion: Dual-Domain Dual-Query Camera-LiDAR Fusion for 3D Object Detection [13.068266058374775]
We propose a novel camera-LiDAR fusion architecture called 3D Dual-Fusion. The proposed method fuses the features of the camera-view and 3D voxel-view domain and models their interactions through deformable attention. The results of an experimental evaluation show that the proposed camera-LiDAR fusion architecture achieved competitive performance on the KITTI and nuScenes datasets.
arXiv Detail & Related papers (2022-11-24T11:00:50Z)
Homogeneous Multi-modal Feature Fusion and Interaction for 3D Object Detection [16.198358858773258]
Multi-modal 3D object detection has been an active research topic in autonomous driving. It is non-trivial to explore the cross-modal feature fusion between sparse 3D points and dense 2D pixels. Recent approaches either fuse the image features with the point cloud features that are projected onto the 2D image plane or combine the sparse point cloud with dense image pixels.
arXiv Detail & Related papers (2022-10-18T06:15:56Z)
MSMDFusion: Fusing LiDAR and Camera at Multiple Scales with Multi-Depth Seeds for 3D Object Detection [89.26380781863665]
Fusing LiDAR and camera information is essential for achieving accurate and reliable 3D object detection in autonomous driving systems. Recent approaches aim at exploring the semantic densities of camera features through lifting points in 2D camera images into 3D space for fusion. We propose a novel framework that focuses on the multi-scale progressive interaction of the multi-granularity LiDAR and camera features.
arXiv Detail & Related papers (2022-09-07T12:29:29Z)
Unifying Voxel-based Representation with Transformer for 3D Object Detection [143.91910747605107]
We present a unified framework for multi-modality 3D object detection, named UVTR. The proposed method aims to unify multi-modality representations in the voxel space for accurate and robust single- or cross-modality 3D detection. UVTR achieves leading performance in the nuScenes test set with 69.7%, 55.1%, and 71.1% NDS for LiDAR, camera, and multi-modality inputs, respectively.
arXiv Detail & Related papers (2022-06-01T17:02:40Z)
TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers [49.689566246504356]
We propose TransFusion, a robust solution to LiDAR-camera fusion with a soft-association mechanism to handle inferior image conditions. TransFusion achieves state-of-the-art performance on large-scale datasets. We extend the proposed method to the 3D tracking task and achieve the 1st place in the leaderboard of nuScenes tracking.
arXiv Detail & Related papers (2022-03-22T07:15:13Z)
DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection [83.18142309597984]
Lidars and cameras are critical sensors that provide complementary information for 3D detection in autonomous driving. We develop a family of generic multi-modal 3D detection models named DeepFusion, which is more accurate than previous methods.
arXiv Detail & Related papers (2022-03-15T18:46:06Z)
EPNet++: Cascade Bi-directional Fusion for Multi-Modal 3D Object Detection [56.03081616213012]
We propose EPNet++ for multi-modal 3D object detection by introducing a novel Cascade Bi-directional Fusion(CB-Fusion) module. The proposed CB-Fusion module boosts the plentiful semantic information of point features with the image features in a cascade bi-directional interaction fusion manner. The experiment results on the KITTI, JRDB and SUN-RGBD datasets demonstrate the superiority of EPNet++ over the state-of-the-art methods.
arXiv Detail & Related papers (2021-12-21T10:48:34Z)
Similarity-Aware Fusion Network for 3D Semantic Segmentation [87.51314162700315]
We propose a similarity-aware fusion network (SAFNet) to adaptively fuse 2D images and 3D point clouds for 3D semantic segmentation. We employ a late fusion strategy where we first learn the geometric and contextual similarities between the input and back-projected (from 2D pixels) point clouds. We show that SAFNet significantly outperforms existing state-of-the-art fusion-based approaches across various data integrity.
arXiv Detail & Related papers (2021-07-04T09:28:18Z)
Volumetric Propagation Network: Stereo-LiDAR Fusion for Long-Range Depth Estimation [81.08111209632501]
We propose a geometry-aware stereo-LiDAR fusion network for long-range depth estimation. We exploit sparse and accurate point clouds as a cue for guiding correspondences of stereo images in a unified 3D volume space. Our network achieves state-of-the-art performance on the KITTI and the Virtual- KITTI datasets.
arXiv Detail & Related papers (2021-03-24T03:24:46Z)
Cross-Modality 3D Object Detection [63.29935886648709]
We present a novel two-stage multi-modal fusion network for 3D object detection. The whole architecture facilitates two-stage fusion. Our experiments on the KITTI dataset show that the proposed multi-stage fusion helps the network to learn better representations.
arXiv Detail & Related papers (2020-08-16T11:01:20Z)
3D-CVF: Generating Joint Camera and LiDAR Features Using Cross-View Spatial Feature Fusion for 3D Object Detection [10.507404260449333]
We propose a new architecture for fusing camera and LiDAR sensors for 3D object detection. The proposed 3D-CVF achieves state-of-the-art performance in the KITTI benchmark.
arXiv Detail & Related papers (2020-04-27T08:34:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.