BEVDet: High-performance Multi-camera 3D Object Detection in
Bird-Eye-View
- URL: http://arxiv.org/abs/2112.11790v1
- Date: Wed, 22 Dec 2021 10:48:06 GMT
- Title: BEVDet: High-performance Multi-camera 3D Object Detection in
Bird-Eye-View
- Authors: Junjie Huang, Guan Huang, Zheng Zhu, and Dalong Du
- Abstract summary: We contribute the BEVDet paradigm for pushing the performance boundary in 2D object detection task.
BeVDet is developed by following the principle of detecting the 3D objects in Bird-Eye-View (BEV), where route planning can be handily performed.
The proposed paradigm works well in multi-camera 3D object detection and offers a good trade-off between computing budget and performance.
- Score: 15.560366079077449
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Autonomous driving perceives the surrounding environment for decision making,
which is one of the most complicated scenes for visual perception. The great
power of paradigm innovation in solving the 2D object detection task inspires
us to seek an elegant, feasible, and scalable paradigm for pushing the
performance boundary in this area. To this end, we contribute the BEVDet
paradigm in this paper. BEVDet is developed by following the principle of
detecting the 3D objects in Bird-Eye-View (BEV), where route planning can be
handily performed. In this paradigm, four kinds of modules are conducted in
succession with different roles: an image-view encoder for encoding feature in
image view, a view transformer for feature transformation from image view to
BEV, a BEV encoder for further encoding feature in BEV, and a task-specific
head for predicting the targets in BEV. We merely reuse the existing modules
for constructing BEVDet and make it feasible for multi-camera 3D object
detection by constructing an exclusive data augmentation strategy. The proposed
paradigm works well in multi-camera 3D object detection and offers a good
trade-off between computing budget and performance. BEVDet with 704x256 (1/8 of
the competitors) image size scores 29.4% mAP and 38.4% NDS on the nuScenes val
set, which is comparable with FCOS3D (i.e., 2008.2 GFLOPs, 1.7 FPS, 29.5% mAP
and 37.2% NDS), while requires merely 12% computing budget of 239.4 GFLOPs and
runs 4.3 times faster. Scaling up the input size to 1408x512, BEVDet scores
34.9% mAP, and 41.7% NDS, which requires just 601.4 GFLOPs and significantly
suppresses FCOS3D by 5.4% mAP and 4.5% NDS. The superior performance of BEVDet
tells the magic of paradigm innovation.
Related papers
- WidthFormer: Toward Efficient Transformer-based BEV View Transformation [21.10523575080856]
WidthFormer is a transformer-based module to compute Bird's-Eye-View (BEV) representations from multi-view cameras for real-time autonomous-driving applications.
We first introduce a novel 3D positional encoding mechanism capable of accurately encapsulating 3D geometric information.
We then develop two modules to compensate for potential information loss due to feature compression.
arXiv Detail & Related papers (2024-01-08T11:50:23Z) - QD-BEV : Quantization-aware View-guided Distillation for Multi-view 3D
Object Detection [57.019527599167255]
Multi-view 3D detection based on BEV (bird-eye-view) has recently achieved significant improvements.
We show in our paper that directly applying quantization in BEV tasks will 1) make the training unstable, and 2) lead to intolerable performance degradation.
Our method QD-BEV enables a novel view-guided distillation (VGD) objective, which can stabilize the quantization-aware training (QAT) while enhancing the model performance.
arXiv Detail & Related papers (2023-08-21T07:06:49Z) - MetaBEV: Solving Sensor Failures for BEV Detection and Map Segmentation [104.12419434114365]
In real-world applications, sensor corruptions and failures lead to inferior performances.
We propose a robust framework, called MetaBEV, to address extreme real-world environments.
We show MetaBEV outperforms prior arts by a large margin on both full and corrupted modalities.
arXiv Detail & Related papers (2023-04-19T16:37:17Z) - BEVFusion4D: Learning LiDAR-Camera Fusion Under Bird's-Eye-View via
Cross-Modality Guidance and Temporal Aggregation [14.606324706328106]
We propose a dual-branch framework to generate LiDAR and camera BEV, then perform an adaptive modality fusion.
A LiDAR-Guided View Transformer (LGVT) is designed to effectively obtain the camera representation in BEV space.
Our framework dubbed BEVFusion4D achieves state-of-the-art results in 3D object detection.
arXiv Detail & Related papers (2023-03-30T02:18:07Z) - Fast-BEV: A Fast and Strong Bird's-Eye View Perception Baseline [76.48192454417138]
Bird's-Eye View (BEV) representation is promising as the foundation for next-generation Autonomous Vehicle (AV) perception.
This paper proposes a framework, termed Fast-BEV, which is capable of performing faster BEV perception on the on-vehicle chips.
arXiv Detail & Related papers (2023-01-29T18:43:31Z) - BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation [105.96557764248846]
We introduce BEVFusion, a generic multi-task multi-sensor fusion framework.
It unifies multi-modal features in the shared bird's-eye view representation space.
It achieves 1.3% higher mAP and NDS on 3D object detection and 13.6% higher mIoU on BEV map segmentation, with 1.9x lower cost.
arXiv Detail & Related papers (2022-05-26T17:59:35Z) - M^2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified
Birds-Eye View Representation [145.6041893646006]
M$2$BEV is a unified framework that jointly performs 3D object detection and map segmentation.
M$2$BEV infers both tasks with a unified model and improves efficiency.
arXiv Detail & Related papers (2022-04-11T13:43:25Z) - BEVDet4D: Exploit Temporal Cues in Multi-camera 3D Object Detection [14.11339105810819]
BEVDet4D is proposed to lift the scalable BEVDet paradigm from the spatial-only 3D space to the spatial-temporal 4D space.
We simplify the velocity learning task by removing the factors of ego-motion and time, which equips BEVDet4D with robust generalization performance.
On challenge benchmark nuScenes, we report a new record of 51.5% NDS with the high-performance configuration dubbed BEVDet4D-Base.
arXiv Detail & Related papers (2022-03-31T14:21:19Z) - Robust 2D/3D Vehicle Parsing in CVIS [54.825777404511605]
We present a novel approach to robustly detect and perceive vehicles in different camera views as part of a cooperative vehicle-infrastructure system (CVIS)
Our formulation is designed for arbitrary camera views and makes no assumptions about intrinsic or extrinsic parameters.
In practice, our approach outperforms SOTA methods on 2D detection, instance segmentation, and 6-DoF pose estimation.
arXiv Detail & Related papers (2021-03-11T03:35:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.