X$^3$KD: Knowledge Distillation Across Modalities, Tasks and Stages for
Multi-Camera 3D Object Detection
- URL: http://arxiv.org/abs/2303.02203v1
- Date: Fri, 3 Mar 2023 20:29:49 GMT
- Title: X$^3$KD: Knowledge Distillation Across Modalities, Tasks and Stages for
Multi-Camera 3D Object Detection
- Authors: Marvin Klingner, Shubhankar Borse, Varun Ravi Kumar, Behnaz Rezaei,
Venkatraman Narayanan, Senthil Yogamani, Fatih Porikli
- Abstract summary: This paper introduces X$3$KD, a comprehensive knowledge distillation framework across different modalities, tasks, and stages for multi-camera 3DOD.
After the transformation, we apply cross-modal feature distillation (X-FD) and adversarial training (X-AT) to improve the 3D world representation of multi-camera features.
Our final X$3$KD model outperforms previous state-of-the-art approaches on the nuScenes and datasets.
- Score: 45.32989526953387
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Recent advances in 3D object detection (3DOD) have obtained remarkably strong
results for LiDAR-based models. In contrast, surround-view 3DOD models based on
multiple camera images underperform due to the necessary view transformation of
features from perspective view (PV) to a 3D world representation which is
ambiguous due to missing depth information. This paper introduces X$^3$KD, a
comprehensive knowledge distillation framework across different modalities,
tasks, and stages for multi-camera 3DOD. Specifically, we propose cross-task
distillation from an instance segmentation teacher (X-IS) in the PV feature
extraction stage providing supervision without ambiguous error backpropagation
through the view transformation. After the transformation, we apply cross-modal
feature distillation (X-FD) and adversarial training (X-AT) to improve the 3D
world representation of multi-camera features through the information contained
in a LiDAR-based 3DOD teacher. Finally, we also employ this teacher for
cross-modal output distillation (X-OD), providing dense supervision at the
prediction stage. We perform extensive ablations of knowledge distillation at
different stages of multi-camera 3DOD. Our final X$^3$KD model outperforms
previous state-of-the-art approaches on the nuScenes and Waymo datasets and
generalizes to RADAR-based 3DOD. Qualitative results video at
https://youtu.be/1do9DPFmr38.
Related papers
- MonoTAKD: Teaching Assistant Knowledge Distillation for Monocular 3D Object Detection [42.4932760909941]
Monocular 3D object detection is an indispensable research topic in autonomous driving.
The challenges of Mono3D lie in understanding 3D scene geometry and reconstructing 3D object information from a single image.
Previous methods attempted to transfer 3D information directly from the LiDAR-based teacher to the camera-based student.
arXiv Detail & Related papers (2024-04-07T10:39:04Z) - MVD-Fusion: Single-view 3D via Depth-consistent Multi-view Generation [54.27399121779011]
We present MVD-Fusion: a method for single-view 3D inference via generative modeling of multi-view-consistent RGB-D images.
We show that our approach can yield more accurate synthesis compared to recent state-of-the-art, including distillation-based 3D inference and prior multi-view generation methods.
arXiv Detail & Related papers (2024-04-04T17:59:57Z) - Volumetric Environment Representation for Vision-Language Navigation [66.04379819772764]
Vision-language navigation (VLN) requires an agent to navigate through a 3D environment based on visual observations and natural language instructions.
We introduce a Volumetric Environment Representation (VER), which voxelizes the physical world into structured 3D cells.
VER predicts 3D occupancy, 3D room layout, and 3D bounding boxes jointly.
arXiv Detail & Related papers (2024-03-21T06:14:46Z) - Weakly Supervised Monocular 3D Detection with a Single-View Image [58.57978772009438]
Monocular 3D detection aims for precise 3D object localization from a single-view image.
We propose SKD-WM3D, a weakly supervised monocular 3D detection framework.
We show that SKD-WM3D surpasses the state-of-the-art clearly and is even on par with many fully supervised methods.
arXiv Detail & Related papers (2024-02-29T13:26:47Z) - Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection [66.74183705987276]
We introduce a framework to improve the camera-only apprentice model, including an apprentice-friendly multi-modal expert and temporal-fusion-friendly distillation supervision.
With those improvements, our camera-only apprentice VCD-A sets new state-of-the-art on nuScenes with a score of 63.1% NDS.
arXiv Detail & Related papers (2023-10-24T09:29:26Z) - DistillBEV: Boosting Multi-Camera 3D Object Detection with Cross-Modal
Knowledge Distillation [25.933070263556374]
3D perception based on representations learned from multi-camera bird's-eye-view (BEV) is trending as cameras are cost-effective for mass production in autonomous driving industry.
There exists a distinct performance gap between multi-camera BEV and LiDAR based 3D object detection.
We propose to boost the representation learning of a multi-camera BEV based student detector by training it to imitate the features of a well-trained LiDAR based teacher detector.
arXiv Detail & Related papers (2023-09-26T17:56:21Z) - SimDistill: Simulated Multi-modal Distillation for BEV 3D Object
Detection [56.24700754048067]
Multi-view camera-based 3D object detection has become popular due to its low cost, but accurately inferring 3D geometry solely from camera data remains challenging.
We propose a Simulated multi-modal Distillation (SimDistill) method by carefully crafting the model architecture and distillation strategy.
Our SimDistill can learn better feature representations for 3D object detection while maintaining a cost-effective camera-only deployment.
arXiv Detail & Related papers (2023-03-29T16:08:59Z) - UniDistill: A Universal Cross-Modality Knowledge Distillation Framework
for 3D Object Detection in Bird's-Eye View [7.1054067852590865]
We propose a universal cross-modality knowledge distillation framework (UniDistill) to improve the performance of single-modality detectors.
UniDistill easily supports LiDAR-to-camera, camera-to-LiDAR, fusion-to-LiDAR and fusion-to-camera distillation paths.
Experiments on nuScenes demonstrate that UniDistill effectively improves the mAP and NDS of student detectors by 2.0%3.2%.
arXiv Detail & Related papers (2023-03-27T10:50:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.