DistillBEV: Boosting Multi-Camera 3D Object Detection with Cross-Modal
Knowledge Distillation
- URL: http://arxiv.org/abs/2309.15109v1
- Date: Tue, 26 Sep 2023 17:56:21 GMT
- Title: DistillBEV: Boosting Multi-Camera 3D Object Detection with Cross-Modal
Knowledge Distillation
- Authors: Zeyu Wang, Dingwen Li, Chenxu Luo, Cihang Xie, Xiaodong Yang
- Abstract summary: 3D perception based on representations learned from multi-camera bird's-eye-view (BEV) is trending as cameras are cost-effective for mass production in autonomous driving industry.
There exists a distinct performance gap between multi-camera BEV and LiDAR based 3D object detection.
We propose to boost the representation learning of a multi-camera BEV based student detector by training it to imitate the features of a well-trained LiDAR based teacher detector.
- Score: 25.933070263556374
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D perception based on the representations learned from multi-camera
bird's-eye-view (BEV) is trending as cameras are cost-effective for mass
production in autonomous driving industry. However, there exists a distinct
performance gap between multi-camera BEV and LiDAR based 3D object detection.
One key reason is that LiDAR captures accurate depth and other geometry
measurements, while it is notoriously challenging to infer such 3D information
from merely image input. In this work, we propose to boost the representation
learning of a multi-camera BEV based student detector by training it to imitate
the features of a well-trained LiDAR based teacher detector. We propose
effective balancing strategy to enforce the student to focus on learning the
crucial features from the teacher, and generalize knowledge transfer to
multi-scale layers with temporal fusion. We conduct extensive evaluations on
multiple representative models of multi-camera BEV. Experiments reveal that our
approach renders significant improvement over the student models, leading to
the state-of-the-art performance on the popular benchmark nuScenes.
Related papers
- Instance-aware Multi-Camera 3D Object Detection with Structural Priors
Mining and Self-Boosting Learning [93.71280187657831]
Camera-based bird-eye-view (BEV) perception paradigm has made significant progress in the autonomous driving field.
We propose IA-BEV, which integrates image-plane instance awareness into the depth estimation process within a BEV-based detector.
arXiv Detail & Related papers (2023-12-13T09:24:42Z) - Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection [66.74183705987276]
We introduce a framework to improve the camera-only apprentice model, including an apprentice-friendly multi-modal expert and temporal-fusion-friendly distillation supervision.
With those improvements, our camera-only apprentice VCD-A sets new state-of-the-art on nuScenes with a score of 63.1% NDS.
arXiv Detail & Related papers (2023-10-24T09:29:26Z) - CALICO: Self-Supervised Camera-LiDAR Contrastive Pre-training for BEV
Perception [32.91233926771015]
CALICO is a novel framework that applies contrastive objectives to both LiDAR and camera backbones.
Our framework can be tailored to different backbones and heads, positioning it as a promising approach for multimodal BEV perception.
arXiv Detail & Related papers (2023-06-01T05:06:56Z) - SimDistill: Simulated Multi-modal Distillation for BEV 3D Object
Detection [56.24700754048067]
Multi-view camera-based 3D object detection has become popular due to its low cost, but accurately inferring 3D geometry solely from camera data remains challenging.
We propose a Simulated multi-modal Distillation (SimDistill) method by carefully crafting the model architecture and distillation strategy.
Our SimDistill can learn better feature representations for 3D object detection while maintaining a cost-effective camera-only deployment.
arXiv Detail & Related papers (2023-03-29T16:08:59Z) - BEV-LGKD: A Unified LiDAR-Guided Knowledge Distillation Framework for
BEV 3D Object Detection [40.45938603642747]
We propose a unified framework named BEV-LGKD to transfer the knowledge in the teacher-student manner.
Our method only uses LiDAR points to guide the KD between RGB models.
arXiv Detail & Related papers (2022-12-01T16:17:39Z) - BEVDistill: Cross-Modal BEV Distillation for Multi-View 3D Object
Detection [17.526914782562528]
3D object detection from multiple image views is a challenging task for visual scene understanding.
We propose textbfBEVDistill, a cross-modal BEV knowledge distillation framework for multi-view 3D object detection.
Our best model achieves 59.4 NDS on the nuScenes test leaderboard, achieving new state-of-the-art in comparison with various image-based detectors.
arXiv Detail & Related papers (2022-11-17T07:26:14Z) - Structured Knowledge Distillation Towards Efficient and Compact
Multi-View 3D Detection [30.74309289544479]
We propose a structured knowledge distillation framework to improve the efficiency of vision-only BEV detection models.
Experimental results show that our method leads to an average improvement of 2.16 mAP and 2.27 NDS on the nuScenes benchmark.
arXiv Detail & Related papers (2022-11-14T12:51:17Z) - A Simple Baseline for Multi-Camera 3D Object Detection [94.63944826540491]
3D object detection with surrounding cameras has been a promising direction for autonomous driving.
We present SimMOD, a Simple baseline for Multi-camera Object Detection.
We conduct extensive experiments on the 3D object detection benchmark of nuScenes to demonstrate the effectiveness of SimMOD.
arXiv Detail & Related papers (2022-08-22T03:38:01Z) - SSMTL++: Revisiting Self-Supervised Multi-Task Learning for Video
Anomaly Detection [108.57862846523858]
We revisit the self-supervised multi-task learning framework, proposing several updates to the original method.
We modernize the 3D convolutional backbone by introducing multi-head self-attention modules.
In our attempt to further improve the model, we study additional self-supervised learning tasks, such as predicting segmentation maps.
arXiv Detail & Related papers (2022-07-16T19:25:41Z) - BEVerse: Unified Perception and Prediction in Birds-Eye-View for
Vision-Centric Autonomous Driving [92.05963633802979]
We present BEVerse, a unified framework for 3D perception and prediction based on multi-camera systems.
We show that the multi-task BEVerse outperforms single-task methods on 3D object detection, semantic map construction, and motion prediction.
arXiv Detail & Related papers (2022-05-19T17:55:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.