Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection
- URL: http://arxiv.org/abs/2310.15670v1
- Date: Tue, 24 Oct 2023 09:29:26 GMT
- Title: Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection
- Authors: Linyan Huang, Zhiqi Li, Chonghao Sima, Wenhai Wang, Jingdong Wang, Yu
Qiao, Hongyang Li
- Abstract summary: We introduce a framework to improve the camera-only apprentice model, including an apprentice-friendly multi-modal expert and temporal-fusion-friendly distillation supervision.
With those improvements, our camera-only apprentice VCD-A sets new state-of-the-art on nuScenes with a score of 63.1% NDS.
- Score: 66.74183705987276
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Current research is primarily dedicated to advancing the accuracy of
camera-only 3D object detectors (apprentice) through the knowledge transferred
from LiDAR- or multi-modal-based counterparts (expert). However, the presence
of the domain gap between LiDAR and camera features, coupled with the inherent
incompatibility in temporal fusion, significantly hinders the effectiveness of
distillation-based enhancements for apprentices. Motivated by the success of
uni-modal distillation, an apprentice-friendly expert model would predominantly
rely on camera features, while still achieving comparable performance to
multi-modal models. To this end, we introduce VCD, a framework to improve the
camera-only apprentice model, including an apprentice-friendly multi-modal
expert and temporal-fusion-friendly distillation supervision. The multi-modal
expert VCD-E adopts an identical structure as that of the camera-only
apprentice in order to alleviate the feature disparity, and leverages LiDAR
input as a depth prior to reconstruct the 3D scene, achieving the performance
on par with other heterogeneous multi-modal experts. Additionally, a
fine-grained trajectory-based distillation module is introduced with the
purpose of individually rectifying the motion misalignment for each object in
the scene. With those improvements, our camera-only apprentice VCD-A sets new
state-of-the-art on nuScenes with a score of 63.1% NDS.
Related papers
- MonoTAKD: Teaching Assistant Knowledge Distillation for Monocular 3D Object Detection [42.4932760909941]
Monocular 3D object detection is an indispensable research topic in autonomous driving.
The challenges of Mono3D lie in understanding 3D scene geometry and reconstructing 3D object information from a single image.
Previous methods attempted to transfer 3D information directly from the LiDAR-based teacher to the camera-based student.
arXiv Detail & Related papers (2024-04-07T10:39:04Z) - DistillBEV: Boosting Multi-Camera 3D Object Detection with Cross-Modal
Knowledge Distillation [25.933070263556374]
3D perception based on representations learned from multi-camera bird's-eye-view (BEV) is trending as cameras are cost-effective for mass production in autonomous driving industry.
There exists a distinct performance gap between multi-camera BEV and LiDAR based 3D object detection.
We propose to boost the representation learning of a multi-camera BEV based student detector by training it to imitate the features of a well-trained LiDAR based teacher detector.
arXiv Detail & Related papers (2023-09-26T17:56:21Z) - CALICO: Self-Supervised Camera-LiDAR Contrastive Pre-training for BEV
Perception [32.91233926771015]
CALICO is a novel framework that applies contrastive objectives to both LiDAR and camera backbones.
Our framework can be tailored to different backbones and heads, positioning it as a promising approach for multimodal BEV perception.
arXiv Detail & Related papers (2023-06-01T05:06:56Z) - SimDistill: Simulated Multi-modal Distillation for BEV 3D Object
Detection [56.24700754048067]
Multi-view camera-based 3D object detection has become popular due to its low cost, but accurately inferring 3D geometry solely from camera data remains challenging.
We propose a Simulated multi-modal Distillation (SimDistill) method by carefully crafting the model architecture and distillation strategy.
Our SimDistill can learn better feature representations for 3D object detection while maintaining a cost-effective camera-only deployment.
arXiv Detail & Related papers (2023-03-29T16:08:59Z) - UniDistill: A Universal Cross-Modality Knowledge Distillation Framework
for 3D Object Detection in Bird's-Eye View [7.1054067852590865]
We propose a universal cross-modality knowledge distillation framework (UniDistill) to improve the performance of single-modality detectors.
UniDistill easily supports LiDAR-to-camera, camera-to-LiDAR, fusion-to-LiDAR and fusion-to-camera distillation paths.
Experiments on nuScenes demonstrate that UniDistill effectively improves the mAP and NDS of student detectors by 2.0%3.2%.
arXiv Detail & Related papers (2023-03-27T10:50:58Z) - X$^3$KD: Knowledge Distillation Across Modalities, Tasks and Stages for
Multi-Camera 3D Object Detection [45.32989526953387]
This paper introduces X$3$KD, a comprehensive knowledge distillation framework across different modalities, tasks, and stages for multi-camera 3DOD.
After the transformation, we apply cross-modal feature distillation (X-FD) and adversarial training (X-AT) to improve the 3D world representation of multi-camera features.
Our final X$3$KD model outperforms previous state-of-the-art approaches on the nuScenes and datasets.
arXiv Detail & Related papers (2023-03-03T20:29:49Z) - A Simple Baseline for Multi-Camera 3D Object Detection [94.63944826540491]
3D object detection with surrounding cameras has been a promising direction for autonomous driving.
We present SimMOD, a Simple baseline for Multi-camera Object Detection.
We conduct extensive experiments on the 3D object detection benchmark of nuScenes to demonstrate the effectiveness of SimMOD.
arXiv Detail & Related papers (2022-08-22T03:38:01Z) - SSMTL++: Revisiting Self-Supervised Multi-Task Learning for Video
Anomaly Detection [108.57862846523858]
We revisit the self-supervised multi-task learning framework, proposing several updates to the original method.
We modernize the 3D convolutional backbone by introducing multi-head self-attention modules.
In our attempt to further improve the model, we study additional self-supervised learning tasks, such as predicting segmentation maps.
arXiv Detail & Related papers (2022-07-16T19:25:41Z) - SGM3D: Stereo Guided Monocular 3D Object Detection [62.11858392862551]
We propose a stereo-guided monocular 3D object detection network, termed SGM3D.
We exploit robust 3D features extracted from stereo images to enhance the features learned from the monocular image.
Our method can be integrated into many other monocular approaches to boost performance without introducing any extra computational cost.
arXiv Detail & Related papers (2021-12-03T13:57:14Z) - Monocular Depth Estimation with Self-supervised Instance Adaptation [138.0231868286184]
In robotics applications, multiple views of a scene may or may not be available, depend-ing on the actions of the robot.
We propose a new approach that extends any off-the-shelf self-supervised monocular depth reconstruction system to usemore than one image at test time.
arXiv Detail & Related papers (2020-04-13T08:32:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.