BEVFusion4D: Learning LiDAR-Camera Fusion Under Bird's-Eye-View via
Cross-Modality Guidance and Temporal Aggregation
- URL: http://arxiv.org/abs/2303.17099v1
- Date: Thu, 30 Mar 2023 02:18:07 GMT
- Title: BEVFusion4D: Learning LiDAR-Camera Fusion Under Bird's-Eye-View via
Cross-Modality Guidance and Temporal Aggregation
- Authors: Hongxiang Cai, Zeyuan Zhang, Zhenyu Zhou, Ziyin Li, Wenbo Ding, Jiuhua
Zhao
- Abstract summary: We propose a dual-branch framework to generate LiDAR and camera BEV, then perform an adaptive modality fusion.
A LiDAR-Guided View Transformer (LGVT) is designed to effectively obtain the camera representation in BEV space.
Our framework dubbed BEVFusion4D achieves state-of-the-art results in 3D object detection.
- Score: 14.606324706328106
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Integrating LiDAR and Camera information into Bird's-Eye-View (BEV) has
become an essential topic for 3D object detection in autonomous driving.
Existing methods mostly adopt an independent dual-branch framework to generate
LiDAR and camera BEV, then perform an adaptive modality fusion. Since point
clouds provide more accurate localization and geometry information, they could
serve as a reliable spatial prior to acquiring relevant semantic information
from the images. Therefore, we design a LiDAR-Guided View Transformer (LGVT) to
effectively obtain the camera representation in BEV space and thus benefit the
whole dual-branch fusion system. LGVT takes camera BEV as the primitive
semantic query, repeatedly leveraging the spatial cue of LiDAR BEV for
extracting image features across multiple camera views. Moreover, we extend our
framework into the temporal domain with our proposed Temporal Deformable
Alignment (TDA) module, which aims to aggregate BEV features from multiple
historical frames. Including these two modules, our framework dubbed
BEVFusion4D achieves state-of-the-art results in 3D object detection, with
72.0% mAP and 73.5% NDS on the nuScenes validation set, and 73.3% mAP and 74.7%
NDS on nuScenes test set, respectively.
Related papers
- SimpleBEV: Improved LiDAR-Camera Fusion Architecture for 3D Object Detection [15.551625571158056]
We propose a LiDAR-camera fusion framework, named SimpleBEV, for accurate 3D object detection.
Our method achieves 77.6% NDS accuracy on the nuScenes dataset, showcasing superior performance in the 3D object detection track.
arXiv Detail & Related papers (2024-11-08T02:51:39Z) - VFMM3D: Releasing the Potential of Image by Vision Foundation Model for Monocular 3D Object Detection [80.62052650370416]
monocular 3D object detection holds significant importance across various applications, including autonomous driving and robotics.
In this paper, we present VFMM3D, an innovative framework that leverages the capabilities of Vision Foundation Models (VFMs) to accurately transform single-view images into LiDAR point cloud representations.
arXiv Detail & Related papers (2024-04-15T03:12:12Z) - TiG-BEV: Multi-view BEV 3D Object Detection via Target Inner-Geometry
Learning [7.6887888234987125]
We propose a learning scheme of Target Inner-Geometry from the LiDAR modality into camera-based BEV detectors.
TiG-BEV can effectively boost BEVDepth by +2.3% NDS and +2.4% mAP, along with BEVDet by +9.1% NDS and +10.3% mAP on nuScenes val set.
arXiv Detail & Related papers (2022-12-28T17:53:43Z) - BEV-MAE: Bird's Eye View Masked Autoencoders for Point Cloud
Pre-training in Autonomous Driving Scenarios [51.285561119993105]
We present BEV-MAE, an efficient masked autoencoder pre-training framework for LiDAR-based 3D object detection in autonomous driving.
Specifically, we propose a bird's eye view (BEV) guided masking strategy to guide the 3D encoder learning feature representation.
We introduce a learnable point token to maintain a consistent receptive field size of the 3D encoder.
arXiv Detail & Related papers (2022-12-12T08:15:03Z) - SemanticBEVFusion: Rethink LiDAR-Camera Fusion in Unified Bird's-Eye
View Representation for 3D Object Detection [14.706717531900708]
LiDAR and camera are two essential sensors for 3D object detection in autonomous driving.
Recent methods focus on point-level fusion which paints the LiDAR point cloud with camera features in the perspective view.
We present SemanticBEVFusion to deeply fuse camera features with LiDAR features in a unified BEV representation.
arXiv Detail & Related papers (2022-12-09T05:48:58Z) - MSMDFusion: Fusing LiDAR and Camera at Multiple Scales with Multi-Depth
Seeds for 3D Object Detection [89.26380781863665]
Fusing LiDAR and camera information is essential for achieving accurate and reliable 3D object detection in autonomous driving systems.
Recent approaches aim at exploring the semantic densities of camera features through lifting points in 2D camera images into 3D space for fusion.
We propose a novel framework that focuses on the multi-scale progressive interaction of the multi-granularity LiDAR and camera features.
arXiv Detail & Related papers (2022-09-07T12:29:29Z) - Benchmarking the Robustness of LiDAR-Camera Fusion for 3D Object
Detection [58.81316192862618]
Two critical sensors for 3D perception in autonomous driving are the camera and the LiDAR.
fusing these two modalities can significantly boost the performance of 3D perception models.
We benchmark the state-of-the-art fusion methods for the first time.
arXiv Detail & Related papers (2022-05-30T09:35:37Z) - Fully Convolutional One-Stage 3D Object Detection on LiDAR Range Images [96.66271207089096]
FCOS-LiDAR is a fully convolutional one-stage 3D object detector for LiDAR point clouds of autonomous driving scenes.
We show that an RV-based 3D detector with standard 2D convolutions alone can achieve comparable performance to state-of-the-art BEV-based detectors.
arXiv Detail & Related papers (2022-05-27T05:42:16Z) - BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation [105.96557764248846]
We introduce BEVFusion, a generic multi-task multi-sensor fusion framework.
It unifies multi-modal features in the shared bird's-eye view representation space.
It achieves 1.3% higher mAP and NDS on 3D object detection and 13.6% higher mIoU on BEV map segmentation, with 1.9x lower cost.
arXiv Detail & Related papers (2022-05-26T17:59:35Z) - BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera
Images via Spatiotemporal Transformers [39.253627257740085]
3D visual perception tasks, including 3D detection and map segmentation based on multi-camera images, are essential for autonomous driving systems.
We present a new framework termed BEVFormer, which learns unified BEV representations with transformers to support multiple autonomous driving perception tasks.
We show that BEVFormer remarkably improves the accuracy of velocity estimation and recall of objects under low visibility conditions.
arXiv Detail & Related papers (2022-03-31T17:59:01Z) - BEVDet: High-performance Multi-camera 3D Object Detection in
Bird-Eye-View [15.560366079077449]
We contribute the BEVDet paradigm for pushing the performance boundary in 2D object detection task.
BeVDet is developed by following the principle of detecting the 3D objects in Bird-Eye-View (BEV), where route planning can be handily performed.
The proposed paradigm works well in multi-camera 3D object detection and offers a good trade-off between computing budget and performance.
arXiv Detail & Related papers (2021-12-22T10:48:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.