GraphBEV: Towards Robust BEV Feature Alignment for Multi-Modal 3D Object Detection
- URL: http://arxiv.org/abs/2403.11848v3
- Date: Tue, 2 Jul 2024 12:16:31 GMT
- Title: GraphBEV: Towards Robust BEV Feature Alignment for Multi-Modal 3D Object Detection
- Authors: Ziying Song, Lei Yang, Shaoqing Xu, Lin Liu, Dongyang Xu, Caiyan Jia, Feiyang Jia, Li Wang,
- Abstract summary: We propose a robust fusion framework called Graph BEV to integrate LiDAR and camera BEV features.
Our framework outperforms BEV Fusion by 8.3% under conditions with misalignment noise.
- Score: 18.21607858133675
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Integrating LiDAR and camera information into Bird's-Eye-View (BEV) representation has emerged as a crucial aspect of 3D object detection in autonomous driving. However, existing methods are susceptible to the inaccurate calibration relationship between LiDAR and the camera sensor. Such inaccuracies result in errors in depth estimation for the camera branch, ultimately causing misalignment between LiDAR and camera BEV features. In this work, we propose a robust fusion framework called Graph BEV. Addressing errors caused by inaccurate point cloud projection, we introduce a Local Align module that employs neighbor-aware depth features via Graph matching. Additionally, we propose a Global Align module to rectify the misalignment between LiDAR and camera BEV features. Our Graph BEV framework achieves state-of-the-art performance, with an mAP of 70.1\%, surpassing BEV Fusion by 1.6\% on the nuscenes validation set. Importantly, our Graph BEV outperforms BEV Fusion by 8.3\% under conditions with misalignment noise.
Related papers
- SimpleBEV: Improved LiDAR-Camera Fusion Architecture for 3D Object Detection [15.551625571158056]
We propose a LiDAR-camera fusion framework, named SimpleBEV, for accurate 3D object detection.
Our method achieves 77.6% NDS accuracy on the nuScenes dataset, showcasing superior performance in the 3D object detection track.
arXiv Detail & Related papers (2024-11-08T02:51:39Z) - ContrastAlign: Toward Robust BEV Feature Alignment via Contrastive Learning for Multi-Modal 3D Object Detection [21.05923528672353]
We propose a novel ContrastAlign approach to enhance the alignment of heterogeneous modalities.
Our method achieves state-of-the-art performance, with an mAP of 70.3%, surpassing BEVFusion by 1.8% on the nuScenes validation set.
arXiv Detail & Related papers (2024-05-27T06:43:12Z) - U-BEV: Height-aware Bird's-Eye-View Segmentation and Neural Map-based Relocalization [81.76044207714637]
Relocalization is essential for intelligent vehicles when GPS reception is insufficient or sensor-based localization fails.
Recent advances in Bird's-Eye-View (BEV) segmentation allow for accurate estimation of local scene appearance.
This paper presents U-BEV, a U-Net inspired architecture that extends the current state-of-the-art by allowing the BEV to reason about the scene on multiple height layers before flattening the BEV features.
arXiv Detail & Related papers (2023-10-20T18:57:38Z) - BroadBEV: Collaborative LiDAR-camera Fusion for Broad-sighted Bird's Eye
View Map Construction [31.664613321775516]
We propose a broad BEV fusion (BroadBEV) that addresses the problems with a spatial synchronization approach of cross-modality.
Our strategy aims to enhance camera BEV estimation for a broad-sighted perception while simultaneously improving the completion of LiDAR's sparsity in the entire BEV space.
arXiv Detail & Related papers (2023-09-20T07:55:57Z) - FB-BEV: BEV Representation from Forward-Backward View Transformations [131.11787050205697]
We propose a novel View Transformation Module (VTM) for Bird-Eye-View (BEV) representation.
We instantiate the proposed module with FB-BEV, which achieves a new state-of-the-art result of 62.4% NDS on the nuScenes test set.
arXiv Detail & Related papers (2023-08-04T10:26:55Z) - MetaBEV: Solving Sensor Failures for BEV Detection and Map Segmentation [104.12419434114365]
In real-world applications, sensor corruptions and failures lead to inferior performances.
We propose a robust framework, called MetaBEV, to address extreme real-world environments.
We show MetaBEV outperforms prior arts by a large margin on both full and corrupted modalities.
arXiv Detail & Related papers (2023-04-19T16:37:17Z) - VoxelFormer: Bird's-Eye-View Feature Generation based on Dual-view
Attention for Multi-view 3D Object Detection [47.926010021559314]
transformer-based detectors have demonstrated remarkable performance in 2D visual perception tasks.
However, their performance in multi-view 3D object detection remains inferior to the state-of-the-art (SOTA) of convolutional neural network based detectors.
We propose a novel BEV feature generation method, dual-view attention, which generates attention weights from both the BEV and camera view.
arXiv Detail & Related papers (2023-04-03T15:00:36Z) - BEVFusion4D: Learning LiDAR-Camera Fusion Under Bird's-Eye-View via
Cross-Modality Guidance and Temporal Aggregation [14.606324706328106]
We propose a dual-branch framework to generate LiDAR and camera BEV, then perform an adaptive modality fusion.
A LiDAR-Guided View Transformer (LGVT) is designed to effectively obtain the camera representation in BEV space.
Our framework dubbed BEVFusion4D achieves state-of-the-art results in 3D object detection.
arXiv Detail & Related papers (2023-03-30T02:18:07Z) - BEV-MAE: Bird's Eye View Masked Autoencoders for Point Cloud
Pre-training in Autonomous Driving Scenarios [51.285561119993105]
We present BEV-MAE, an efficient masked autoencoder pre-training framework for LiDAR-based 3D object detection in autonomous driving.
Specifically, we propose a bird's eye view (BEV) guided masking strategy to guide the 3D encoder learning feature representation.
We introduce a learnable point token to maintain a consistent receptive field size of the 3D encoder.
arXiv Detail & Related papers (2022-12-12T08:15:03Z) - BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation [105.96557764248846]
We introduce BEVFusion, a generic multi-task multi-sensor fusion framework.
It unifies multi-modal features in the shared bird's-eye view representation space.
It achieves 1.3% higher mAP and NDS on 3D object detection and 13.6% higher mIoU on BEV map segmentation, with 1.9x lower cost.
arXiv Detail & Related papers (2022-05-26T17:59:35Z) - BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera
Images via Spatiotemporal Transformers [39.253627257740085]
3D visual perception tasks, including 3D detection and map segmentation based on multi-camera images, are essential for autonomous driving systems.
We present a new framework termed BEVFormer, which learns unified BEV representations with transformers to support multiple autonomous driving perception tasks.
We show that BEVFormer remarkably improves the accuracy of velocity estimation and recall of objects under low visibility conditions.
arXiv Detail & Related papers (2022-03-31T17:59:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.