UniDistill: A Universal Cross-Modality Knowledge Distillation Framework
for 3D Object Detection in Bird's-Eye View
- URL: http://arxiv.org/abs/2303.15083v1
- Date: Mon, 27 Mar 2023 10:50:58 GMT
- Title: UniDistill: A Universal Cross-Modality Knowledge Distillation Framework
for 3D Object Detection in Bird's-Eye View
- Authors: Shengchao Zhou, Weizhou Liu, Chen Hu, Shuchang Zhou, and Chao Ma
- Abstract summary: We propose a universal cross-modality knowledge distillation framework (UniDistill) to improve the performance of single-modality detectors.
UniDistill easily supports LiDAR-to-camera, camera-to-LiDAR, fusion-to-LiDAR and fusion-to-camera distillation paths.
Experiments on nuScenes demonstrate that UniDistill effectively improves the mAP and NDS of student detectors by 2.0%3.2%.
- Score: 7.1054067852590865
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the field of 3D object detection for autonomous driving, the sensor
portfolio including multi-modality and single-modality is diverse and complex.
Since the multi-modal methods have system complexity while the accuracy of
single-modal ones is relatively low, how to make a tradeoff between them is
difficult. In this work, we propose a universal cross-modality knowledge
distillation framework (UniDistill) to improve the performance of
single-modality detectors. Specifically, during training, UniDistill projects
the features of both the teacher and the student detector into Bird's-Eye-View
(BEV), which is a friendly representation for different modalities. Then, three
distillation losses are calculated to sparsely align the foreground features,
helping the student learn from the teacher without introducing additional cost
during inference. Taking advantage of the similar detection paradigm of
different detectors in BEV, UniDistill easily supports LiDAR-to-camera,
camera-to-LiDAR, fusion-to-LiDAR and fusion-to-camera distillation paths.
Furthermore, the three distillation losses can filter the effect of misaligned
background information and balance between objects of different sizes,
improving the distillation effectiveness. Extensive experiments on nuScenes
demonstrate that UniDistill effectively improves the mAP and NDS of student
detectors by 2.0%~3.2%.
Related papers
- Multi-Modal Data-Efficient 3D Scene Understanding for Autonomous Driving [58.16024314532443]
We introduce LaserMix++, a framework that integrates laser beam manipulations from disparate LiDAR scans and incorporates LiDAR-camera correspondences to assist data-efficient learning.
Results demonstrate that LaserMix++ outperforms fully supervised alternatives, achieving comparable accuracy with five times fewer annotations.
This substantial advancement underscores the potential of semi-supervised approaches in reducing the reliance on extensive labeled data in LiDAR-based 3D scene understanding systems.
arXiv Detail & Related papers (2024-05-08T17:59:53Z) - OV-Uni3DETR: Towards Unified Open-Vocabulary 3D Object Detection via Cycle-Modality Propagation [67.56268991234371]
OV-Uni3DETR achieves the state-of-the-art performance on various scenarios, surpassing existing methods by more than 6% on average.
Code and pre-trained models will be released later.
arXiv Detail & Related papers (2024-03-28T17:05:04Z) - Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection [66.74183705987276]
We introduce a framework to improve the camera-only apprentice model, including an apprentice-friendly multi-modal expert and temporal-fusion-friendly distillation supervision.
With those improvements, our camera-only apprentice VCD-A sets new state-of-the-art on nuScenes with a score of 63.1% NDS.
arXiv Detail & Related papers (2023-10-24T09:29:26Z) - DistillBEV: Boosting Multi-Camera 3D Object Detection with Cross-Modal
Knowledge Distillation [25.933070263556374]
3D perception based on representations learned from multi-camera bird's-eye-view (BEV) is trending as cameras are cost-effective for mass production in autonomous driving industry.
There exists a distinct performance gap between multi-camera BEV and LiDAR based 3D object detection.
We propose to boost the representation learning of a multi-camera BEV based student detector by training it to imitate the features of a well-trained LiDAR based teacher detector.
arXiv Detail & Related papers (2023-09-26T17:56:21Z) - SimDistill: Simulated Multi-modal Distillation for BEV 3D Object
Detection [56.24700754048067]
Multi-view camera-based 3D object detection has become popular due to its low cost, but accurately inferring 3D geometry solely from camera data remains challenging.
We propose a Simulated multi-modal Distillation (SimDistill) method by carefully crafting the model architecture and distillation strategy.
Our SimDistill can learn better feature representations for 3D object detection while maintaining a cost-effective camera-only deployment.
arXiv Detail & Related papers (2023-03-29T16:08:59Z) - X$^3$KD: Knowledge Distillation Across Modalities, Tasks and Stages for
Multi-Camera 3D Object Detection [45.32989526953387]
This paper introduces X$3$KD, a comprehensive knowledge distillation framework across different modalities, tasks, and stages for multi-camera 3DOD.
After the transformation, we apply cross-modal feature distillation (X-FD) and adversarial training (X-AT) to improve the 3D world representation of multi-camera features.
Our final X$3$KD model outperforms previous state-of-the-art approaches on the nuScenes and datasets.
arXiv Detail & Related papers (2023-03-03T20:29:49Z) - Structured Knowledge Distillation Towards Efficient and Compact
Multi-View 3D Detection [30.74309289544479]
We propose a structured knowledge distillation framework to improve the efficiency of vision-only BEV detection models.
Experimental results show that our method leads to an average improvement of 2.16 mAP and 2.27 NDS on the nuScenes benchmark.
arXiv Detail & Related papers (2022-11-14T12:51:17Z) - Boosting 3D Object Detection by Simulating Multimodality on Point Clouds [51.87740119160152]
This paper presents a new approach to boost a single-modality (LiDAR) 3D object detector by teaching it to simulate features and responses that follow a multi-modality (LiDAR-image) detector.
The approach needs LiDAR-image data only when training the single-modality detector, and once well-trained, it only needs LiDAR data at inference.
Experimental results on the nuScenes dataset show that our approach outperforms all SOTA LiDAR-only 3D detectors.
arXiv Detail & Related papers (2022-06-30T01:44:30Z) - The Devil is in the Task: Exploiting Reciprocal Appearance-Localization
Features for Monocular 3D Object Detection [62.1185839286255]
Low-cost monocular 3D object detection plays a fundamental role in autonomous driving.
We introduce a Dynamic Feature Reflecting Network, named DFR-Net.
We rank 1st among all the monocular 3D object detectors in the KITTI test set.
arXiv Detail & Related papers (2021-12-28T07:31:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.