Structured Knowledge Distillation Towards Efficient and Compact
Multi-View 3D Detection
- URL: http://arxiv.org/abs/2211.08398v1
- Date: Mon, 14 Nov 2022 12:51:17 GMT
- Title: Structured Knowledge Distillation Towards Efficient and Compact
Multi-View 3D Detection
- Authors: Linfeng Zhang, Yukang Shi, Hung-Shuo Tai, Zhipeng Zhang, Yuan He, Ke
Wang, Kaisheng Ma
- Abstract summary: We propose a structured knowledge distillation framework to improve the efficiency of vision-only BEV detection models.
Experimental results show that our method leads to an average improvement of 2.16 mAP and 2.27 NDS on the nuScenes benchmark.
- Score: 30.74309289544479
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Detecting 3D objects from multi-view images is a fundamental problem in 3D
computer vision. Recently, significant breakthrough has been made in multi-view
3D detection tasks. However, the unprecedented detection performance of these
vision BEV (bird's-eye-view) detection models is accompanied with enormous
parameters and computation, which make them unaffordable on edge devices. To
address this problem, in this paper, we propose a structured knowledge
distillation framework, aiming to improve the efficiency of modern vision-only
BEV detection models. The proposed framework mainly includes: (a)
spatial-temporal distillation which distills teacher knowledge of information
fusion from different timestamps and views, (b) BEV response distillation which
distills teacher response to different pillars, and (c) weight-inheriting which
solves the problem of inconsistent inputs between students and teacher in
modern transformer architectures. Experimental results show that our method
leads to an average improvement of 2.16 mAP and 2.27 NDS on the nuScenes
benchmark, outperforming multiple baselines by a large margin.
Related papers
- Multi-View Attentive Contextualization for Multi-View 3D Object Detection [19.874148893464607]
We present Multi-View Attentive Contextualization (MvACon), a simple yet effective method for improving 2D-to-3D feature lifting in query-based 3D (MV3D) object detection.
In experiments, the proposed MvACon is thoroughly tested on the nuScenes benchmark, using both the BEVFormer and its recent 3D deformable attention (DFA3D) variant, as well as the PETR.
arXiv Detail & Related papers (2024-05-20T17:37:10Z) - Towards Unified 3D Object Detection via Algorithm and Data Unification [70.27631528933482]
We build the first unified multi-modal 3D object detection benchmark MM- Omni3D and extend the aforementioned monocular detector to its multi-modal version.
We name the designed monocular and multi-modal detectors as UniMODE and MM-UniMODE, respectively.
arXiv Detail & Related papers (2024-02-28T18:59:31Z) - RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering
Assisted Distillation [50.35403070279804]
3D occupancy prediction is an emerging task that aims to estimate the occupancy states and semantics of 3D scenes using multi-view images.
We propose RadOcc, a Rendering assisted distillation paradigm for 3D Occupancy prediction.
arXiv Detail & Related papers (2023-12-19T03:39:56Z) - Instance-aware Multi-Camera 3D Object Detection with Structural Priors
Mining and Self-Boosting Learning [93.71280187657831]
Camera-based bird-eye-view (BEV) perception paradigm has made significant progress in the autonomous driving field.
We propose IA-BEV, which integrates image-plane instance awareness into the depth estimation process within a BEV-based detector.
arXiv Detail & Related papers (2023-12-13T09:24:42Z) - DistillBEV: Boosting Multi-Camera 3D Object Detection with Cross-Modal
Knowledge Distillation [25.933070263556374]
3D perception based on representations learned from multi-camera bird's-eye-view (BEV) is trending as cameras are cost-effective for mass production in autonomous driving industry.
There exists a distinct performance gap between multi-camera BEV and LiDAR based 3D object detection.
We propose to boost the representation learning of a multi-camera BEV based student detector by training it to imitate the features of a well-trained LiDAR based teacher detector.
arXiv Detail & Related papers (2023-09-26T17:56:21Z) - Geometric-aware Pretraining for Vision-centric 3D Object Detection [77.7979088689944]
We propose a novel geometric-aware pretraining framework called GAPretrain.
GAPretrain serves as a plug-and-play solution that can be flexibly applied to multiple state-of-the-art detectors.
We achieve 46.2 mAP and 55.5 NDS on the nuScenes val set using the BEVFormer method, with a gain of 2.7 and 2.1 points, respectively.
arXiv Detail & Related papers (2023-04-06T14:33:05Z) - SimDistill: Simulated Multi-modal Distillation for BEV 3D Object
Detection [56.24700754048067]
Multi-view camera-based 3D object detection has become popular due to its low cost, but accurately inferring 3D geometry solely from camera data remains challenging.
We propose a Simulated multi-modal Distillation (SimDistill) method by carefully crafting the model architecture and distillation strategy.
Our SimDistill can learn better feature representations for 3D object detection while maintaining a cost-effective camera-only deployment.
arXiv Detail & Related papers (2023-03-29T16:08:59Z) - UniDistill: A Universal Cross-Modality Knowledge Distillation Framework
for 3D Object Detection in Bird's-Eye View [7.1054067852590865]
We propose a universal cross-modality knowledge distillation framework (UniDistill) to improve the performance of single-modality detectors.
UniDistill easily supports LiDAR-to-camera, camera-to-LiDAR, fusion-to-LiDAR and fusion-to-camera distillation paths.
Experiments on nuScenes demonstrate that UniDistill effectively improves the mAP and NDS of student detectors by 2.0%3.2%.
arXiv Detail & Related papers (2023-03-27T10:50:58Z) - BEV-LGKD: A Unified LiDAR-Guided Knowledge Distillation Framework for
BEV 3D Object Detection [40.45938603642747]
We propose a unified framework named BEV-LGKD to transfer the knowledge in the teacher-student manner.
Our method only uses LiDAR points to guide the KD between RGB models.
arXiv Detail & Related papers (2022-12-01T16:17:39Z) - BEVDistill: Cross-Modal BEV Distillation for Multi-View 3D Object
Detection [17.526914782562528]
3D object detection from multiple image views is a challenging task for visual scene understanding.
We propose textbfBEVDistill, a cross-modal BEV knowledge distillation framework for multi-view 3D object detection.
Our best model achieves 59.4 NDS on the nuScenes test leaderboard, achieving new state-of-the-art in comparison with various image-based detectors.
arXiv Detail & Related papers (2022-11-17T07:26:14Z) - SSMTL++: Revisiting Self-Supervised Multi-Task Learning for Video
Anomaly Detection [108.57862846523858]
We revisit the self-supervised multi-task learning framework, proposing several updates to the original method.
We modernize the 3D convolutional backbone by introducing multi-head self-attention modules.
In our attempt to further improve the model, we study additional self-supervised learning tasks, such as predicting segmentation maps.
arXiv Detail & Related papers (2022-07-16T19:25:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.