BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View
Representation
- URL: http://arxiv.org/abs/2205.13542v1
- Date: Thu, 26 May 2022 17:59:35 GMT
- Title: BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View
Representation
- Authors: Zhijian Liu, Haotian Tang, Alexander Amini, Xinyu Yang, Huizi Mao,
Daniela Rus, Song Han
- Abstract summary: We introduce BEVFusion, a generic multi-task multi-sensor fusion framework.
It unifies multi-modal features in the shared bird's-eye view representation space.
It achieves 1.3% higher mAP and NDS on 3D object detection and 13.6% higher mIoU on BEV map segmentation, with 1.9x lower cost.
- Score: 116.6111047218081
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-sensor fusion is essential for an accurate and reliable autonomous
driving system. Recent approaches are based on point-level fusion: augmenting
the LiDAR point cloud with camera features. However, the camera-to-LiDAR
projection throws away the semantic density of camera features, hindering the
effectiveness of such methods, especially for semantic-oriented tasks (such as
3D scene segmentation). In this paper, we break this deeply-rooted convention
with BEVFusion, an efficient and generic multi-task multi-sensor fusion
framework. It unifies multi-modal features in the shared bird's-eye view (BEV)
representation space, which nicely preserves both geometric and semantic
information. To achieve this, we diagnose and lift key efficiency bottlenecks
in the view transformation with optimized BEV pooling, reducing latency by more
than 40x. BEVFusion is fundamentally task-agnostic and seamlessly supports
different 3D perception tasks with almost no architectural changes. It
establishes the new state of the art on nuScenes, achieving 1.3% higher mAP and
NDS on 3D object detection and 13.6% higher mIoU on BEV map segmentation, with
1.9x lower computation cost.
Related papers
- CoBEV: Elevating Roadside 3D Object Detection with Depth and Height
Complementarity [35.3050904302819]
We develop Complementary-BEV, a novel end-to-end monocular 3D object detection framework.
We conduct extensive experiments on the public 3D detection benchmarks of roadside camera-based DAIR-V2X-I and Rope3D.
For the first time, the vehicle AP score of a camera model reaches 80% on DAIR-V2X-I in terms of easy mode.
arXiv Detail & Related papers (2023-10-04T13:38:53Z) - UniM$^2$AE: Multi-modal Masked Autoencoders with Unified 3D
Representation for 3D Perception in Autonomous Driving [51.37470133438836]
Masked Autoencoders (MAE) play a pivotal role in learning potent representations, delivering outstanding results across various 3D perception tasks.
This research delves into multi-modal Masked Autoencoders tailored for a unified representation space in autonomous driving.
To intricately marry the semantics inherent in images with the geometric intricacies of LiDAR point clouds, the UniM$2$AE is proposed.
arXiv Detail & Related papers (2023-08-21T02:13:40Z) - UniTR: A Unified and Efficient Multi-Modal Transformer for
Bird's-Eye-View Representation [113.35352122662752]
We present an efficient multi-modal backbone for outdoor 3D perception named UniTR.
UniTR processes a variety of modalities with unified modeling and shared parameters.
UniTR is also a fundamentally task-agnostic backbone that naturally supports different 3D perception tasks.
arXiv Detail & Related papers (2023-08-15T12:13:44Z) - SimDistill: Simulated Multi-modal Distillation for BEV 3D Object
Detection [56.24700754048067]
Multi-view camera-based 3D object detection has become popular due to its low cost, but accurately inferring 3D geometry solely from camera data remains challenging.
We propose a Simulated multi-modal Distillation (SimDistill) method by carefully crafting the model architecture and distillation strategy.
Our SimDistill can learn better feature representations for 3D object detection while maintaining a cost-effective camera-only deployment.
arXiv Detail & Related papers (2023-03-29T16:08:59Z) - Multi-Camera Calibration Free BEV Representation for 3D Object Detection [8.085831393926561]
We present a completely Multi-Camera Free Transformer (CFT) for robust Bird's Eye View (BEV) representation.
CFT mines potential 3D information in BEV via our designed position-aware enhancement (PA)
CFT achieves 49.7% NDS on the nuScenes detection task leaderboard, which is the first work removing camera parameters.
arXiv Detail & Related papers (2022-10-31T12:18:08Z) - Center Feature Fusion: Selective Multi-Sensor Fusion of Center-based
Objects [26.59231069298659]
We propose a novel approach for building robust 3D object detection systems for autonomous vehicles.
We leverage center-based detection networks in both the camera and LiDAR streams to identify relevant object locations.
On the nuScenes dataset, we outperform the LiDAR-only baseline by 4.9% mAP while fusing up to 100x fewer features than other fusion methods.
arXiv Detail & Related papers (2022-09-26T17:51:18Z) - MSMDFusion: Fusing LiDAR and Camera at Multiple Scales with Multi-Depth
Seeds for 3D Object Detection [89.26380781863665]
Fusing LiDAR and camera information is essential for achieving accurate and reliable 3D object detection in autonomous driving systems.
Recent approaches aim at exploring the semantic densities of camera features through lifting points in 2D camera images into 3D space for fusion.
We propose a novel framework that focuses on the multi-scale progressive interaction of the multi-granularity LiDAR and camera features.
arXiv Detail & Related papers (2022-09-07T12:29:29Z) - M^2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified
Birds-Eye View Representation [145.6041893646006]
M$2$BEV is a unified framework that jointly performs 3D object detection and map segmentation.
M$2$BEV infers both tasks with a unified model and improves efficiency.
arXiv Detail & Related papers (2022-04-11T13:43:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.