KD360-VoxelBEV: LiDAR and 360-degree Camera Cross Modality Knowledge Distillation for Bird's-Eye-View Segmentation
- URL: http://arxiv.org/abs/2512.15311v1
- Date: Wed, 17 Dec 2025 11:00:00 GMT
- Title: KD360-VoxelBEV: LiDAR and 360-degree Camera Cross Modality Knowledge Distillation for Bird's-Eye-View Segmentation
- Authors: Wenke E, Yixin Sun, Jiaxu Liu, Hubert P. H. Shum, Amir Atapour-Abarghouei, Toby P. Breckon,
- Abstract summary: We present the first cross-modality distillation framework specifically tailored for single-panoramic-camera Bird's-Eye-View (BEV) segmentation.<n>Our approach leverages a novel LiDAR image representation fused from range, intensity and ambient channels, together with a voxel-aligned view transformer.<n>During training, a high-capacity LiDAR and camera fusion Teacher network extracts both rich spatial and semantic features for cross-modality knowledge distillation.
- Score: 30.730703237135216
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present the first cross-modality distillation framework specifically tailored for single-panoramic-camera Bird's-Eye-View (BEV) segmentation. Our approach leverages a novel LiDAR image representation fused from range, intensity and ambient channels, together with a voxel-aligned view transformer that preserves spatial fidelity while enabling efficient BEV processing. During training, a high-capacity LiDAR and camera fusion Teacher network extracts both rich spatial and semantic features for cross-modality knowledge distillation into a lightweight Student network that relies solely on a single 360-degree panoramic camera image. Extensive experiments on the Dur360BEV dataset demonstrate that our teacher model significantly outperforms existing camera-based BEV segmentation methods, achieving a 25.6\% IoU improvement. Meanwhile, the distilled Student network attains competitive performance with an 8.5\% IoU gain and state-of-the-art inference speed of 31.2 FPS. Moreover, evaluations on KITTI-360 (two fisheye cameras) confirm that our distillation framework generalises to diverse camera setups, underscoring its feasibility and robustness. This approach reduces sensor complexity and deployment costs while providing a practical solution for efficient, low-cost BEV segmentation in real-world autonomous driving.
Related papers
- BridgeTA: Bridging the Representation Gap in Knowledge Distillation via Teacher Assistant for Bird's Eye View Map Segmentation [17.072492774587456]
Camera-only approaches have drawn attention as cost-effective alternatives to LiDAR, but they still fall behind LiDAR-Camera (LC) fusion-based methods.<n>We introduce BridgeTA, a cost-effective distillation framework to bridge the representation gap between LC fusion and Camera-only models.<n>Our method achieves an improvement of 4.2% mIoU over the Camera-only baseline, up to 45% higher than the improvement of other state-of-the-art KD methods.
arXiv Detail & Related papers (2025-08-13T08:28:21Z) - Dur360BEV: A Real-world 360-degree Single Camera Dataset and Benchmark for Bird-Eye View Mapping in Autonomous Driving [16.771347109638775]
Dur360BEV is an autonomous driving dataset equipped with a high-resolution 128-channel 3D LiDAR and a RTK-refined/INS system.<n>This dataset and benchmark address the challenges of Bird-Eye-View (BEV) maps using only a single spherical camera.
arXiv Detail & Related papers (2025-03-02T00:40:50Z) - OneBEV: Using One Panoramic Image for Bird's-Eye-View Semantic Mapping [25.801868221496473]
OneBEV is a novel BEV semantic mapping approach using merely a single panoramic image as input.
A distortion-aware module termed Mamba View Transformation (MVT) is specifically designed to handle the spatial distortions in panoramas.
This work advances BEV semantic mapping in autonomous driving, paving the way for more advanced and reliable autonomous systems.
arXiv Detail & Related papers (2024-09-20T21:33:53Z) - DA-BEV: Unsupervised Domain Adaptation for Bird's Eye View Perception [104.87876441265593]
Camera-only Bird's Eye View (BEV) has demonstrated great potential in environment perception in a 3D space.
Unsupervised domain adaptive BEV, which effective learning from various unlabelled target data, is far under-explored.
We design DA-BEV, the first domain adaptive camera-only BEV framework that addresses domain adaptive BEV challenges by exploiting the complementary nature of image-view features and BEV features.
arXiv Detail & Related papers (2024-01-13T04:21:24Z) - Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection [66.74183705987276]
We introduce a framework to improve the camera-only apprentice model, including an apprentice-friendly multi-modal expert and temporal-fusion-friendly distillation supervision.
With those improvements, our camera-only apprentice VCD-A sets new state-of-the-art on nuScenes with a score of 63.1% NDS.
arXiv Detail & Related papers (2023-10-24T09:29:26Z) - DistillBEV: Boosting Multi-Camera 3D Object Detection with Cross-Modal
Knowledge Distillation [25.933070263556374]
3D perception based on representations learned from multi-camera bird's-eye-view (BEV) is trending as cameras are cost-effective for mass production in autonomous driving industry.
There exists a distinct performance gap between multi-camera BEV and LiDAR based 3D object detection.
We propose to boost the representation learning of a multi-camera BEV based student detector by training it to imitate the features of a well-trained LiDAR based teacher detector.
arXiv Detail & Related papers (2023-09-26T17:56:21Z) - BEVFusion4D: Learning LiDAR-Camera Fusion Under Bird's-Eye-View via
Cross-Modality Guidance and Temporal Aggregation [14.606324706328106]
We propose a dual-branch framework to generate LiDAR and camera BEV, then perform an adaptive modality fusion.
A LiDAR-Guided View Transformer (LGVT) is designed to effectively obtain the camera representation in BEV space.
Our framework dubbed BEVFusion4D achieves state-of-the-art results in 3D object detection.
arXiv Detail & Related papers (2023-03-30T02:18:07Z) - BiFuse++: Self-supervised and Efficient Bi-projection Fusion for 360
Depth Estimation [59.11106101006008]
We propose BiFuse++ to explore the combination of bi-projection fusion and the self-training scenario.
We propose a new fusion module and Contrast-Aware Photometric Loss to improve the performance of BiFuse.
arXiv Detail & Related papers (2022-09-07T06:24:21Z) - Fully Convolutional One-Stage 3D Object Detection on LiDAR Range Images [96.66271207089096]
FCOS-LiDAR is a fully convolutional one-stage 3D object detector for LiDAR point clouds of autonomous driving scenes.
We show that an RV-based 3D detector with standard 2D convolutions alone can achieve comparable performance to state-of-the-art BEV-based detectors.
arXiv Detail & Related papers (2022-05-27T05:42:16Z) - BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation [105.96557764248846]
We introduce BEVFusion, a generic multi-task multi-sensor fusion framework.
It unifies multi-modal features in the shared bird's-eye view representation space.
It achieves 1.3% higher mAP and NDS on 3D object detection and 13.6% higher mIoU on BEV map segmentation, with 1.9x lower cost.
arXiv Detail & Related papers (2022-05-26T17:59:35Z) - M^2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified
Birds-Eye View Representation [145.6041893646006]
M$2$BEV is a unified framework that jointly performs 3D object detection and map segmentation.
M$2$BEV infers both tasks with a unified model and improves efficiency.
arXiv Detail & Related papers (2022-04-11T13:43:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.