M^2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified
Birds-Eye View Representation
- URL: http://arxiv.org/abs/2204.05088v1
- Date: Mon, 11 Apr 2022 13:43:25 GMT
- Title: M^2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified
Birds-Eye View Representation
- Authors: Enze Xie, Zhiding Yu, Daquan Zhou, Jonah Philion, Anima Anandkumar,
Sanja Fidler, Ping Luo, Jose M. Alvarez
- Abstract summary: M$2$BEV is a unified framework that jointly performs 3D object detection and map segmentation.
M$2$BEV infers both tasks with a unified model and improves efficiency.
- Score: 145.6041893646006
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose M$^2$BEV, a unified framework that jointly performs
3D object detection and map segmentation in the Birds Eye View~(BEV) space with
multi-camera image inputs. Unlike the majority of previous works which
separately process detection and segmentation, M$^2$BEV infers both tasks with
a unified model and improves efficiency. M$^2$BEV efficiently transforms
multi-view 2D image features into the 3D BEV feature in ego-car coordinates.
Such BEV representation is important as it enables different tasks to share a
single encoder. Our framework further contains four important designs that
benefit both accuracy and efficiency: (1) An efficient BEV encoder design that
reduces the spatial dimension of a voxel feature map. (2) A dynamic box
assignment strategy that uses learning-to-match to assign ground-truth 3D boxes
with anchors. (3) A BEV centerness re-weighting that reinforces with larger
weights for more distant predictions, and (4) Large-scale 2D detection
pre-training and auxiliary supervision. We show that these designs
significantly benefit the ill-posed camera-based 3D perception tasks where
depth information is missing. M$^2$BEV is memory efficient, allowing
significantly higher resolution images as input, with faster inference speed.
Experiments on nuScenes show that M$^2$BEV achieves state-of-the-art results in
both 3D object detection and BEV segmentation, with the best single model
achieving 42.5 mAP and 57.0 mIoU in these two tasks, respectively.
Related papers
- DA-BEV: Unsupervised Domain Adaptation for Bird's Eye View Perception [104.87876441265593]
Camera-only Bird's Eye View (BEV) has demonstrated great potential in environment perception in a 3D space.
Unsupervised domain adaptive BEV, which effective learning from various unlabelled target data, is far under-explored.
We design DA-BEV, the first domain adaptive camera-only BEV framework that addresses domain adaptive BEV challenges by exploiting the complementary nature of image-view features and BEV features.
arXiv Detail & Related papers (2024-01-13T04:21:24Z) - WidthFormer: Toward Efficient Transformer-based BEV View Transformation [21.10523575080856]
WidthFormer is a transformer-based module to compute Bird's-Eye-View (BEV) representations from multi-view cameras for real-time autonomous-driving applications.
We first introduce a novel 3D positional encoding mechanism capable of accurately encapsulating 3D geometric information.
We then develop two modules to compensate for potential information loss due to feature compression.
arXiv Detail & Related papers (2024-01-08T11:50:23Z) - SparseBEV: High-Performance Sparse 3D Object Detection from Multi-Camera
Videos [20.51396212498941]
SparseBEV is a fully sparse 3D object detector that outperforms the dense counterparts.
On the test split of nuScenes, SparseBEV achieves the state-of-the-art performance of 67.5 NDS.
arXiv Detail & Related papers (2023-08-18T02:11:01Z) - BEV-IO: Enhancing Bird's-Eye-View 3D Detection with Instance Occupancy [58.92659367605442]
We present BEV-IO, a new 3D detection paradigm to enhance BEV representation with instance occupancy information.
We show that BEV-IO can outperform state-of-the-art methods while only adding a negligible increase in parameters and computational overhead.
arXiv Detail & Related papers (2023-05-26T11:16:12Z) - OA-BEV: Bringing Object Awareness to Bird's-Eye-View Representation for
Multi-Camera 3D Object Detection [78.38062015443195]
OA-BEV is a network that can be plugged into the BEV-based 3D object detection framework.
Our method achieves consistent improvements over the BEV-based baselines in terms of both average precision and nuScenes detection score.
arXiv Detail & Related papers (2023-01-13T06:02:31Z) - TiG-BEV: Multi-view BEV 3D Object Detection via Target Inner-Geometry
Learning [7.6887888234987125]
We propose a learning scheme of Target Inner-Geometry from the LiDAR modality into camera-based BEV detectors.
TiG-BEV can effectively boost BEVDepth by +2.3% NDS and +2.4% mAP, along with BEVDet by +9.1% NDS and +10.3% mAP on nuScenes val set.
arXiv Detail & Related papers (2022-12-28T17:53:43Z) - BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation [105.96557764248846]
We introduce BEVFusion, a generic multi-task multi-sensor fusion framework.
It unifies multi-modal features in the shared bird's-eye view representation space.
It achieves 1.3% higher mAP and NDS on 3D object detection and 13.6% higher mIoU on BEV map segmentation, with 1.9x lower cost.
arXiv Detail & Related papers (2022-05-26T17:59:35Z) - BEVerse: Unified Perception and Prediction in Birds-Eye-View for
Vision-Centric Autonomous Driving [92.05963633802979]
We present BEVerse, a unified framework for 3D perception and prediction based on multi-camera systems.
We show that the multi-task BEVerse outperforms single-task methods on 3D object detection, semantic map construction, and motion prediction.
arXiv Detail & Related papers (2022-05-19T17:55:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.