Related papers: SDGOCC: Semantic and Depth-Guided Bird's-Eye View Transformation for 3D Multimodal Occupancy Prediction

SDGOCC: Semantic and Depth-Guided Bird's-Eye View Transformation for 3D Multimodal Occupancy Prediction

URL: http://arxiv.org/abs/2507.17083v1
Date: Tue, 22 Jul 2025 23:49:40 GMT
Title: SDGOCC: Semantic and Depth-Guided Bird's-Eye View Transformation for 3D Multimodal Occupancy Prediction
Authors: Zaipeng Duan, Chenxu Dang, Xuzhong Hu, Pei An, Junfeng Ding, Jie Zhan, Yunbiao Xu, Jie Ma,
Abstract summary: We propose a novel multimodal occupancy prediction network called SDG-OCC.<n>It incorporates a joint semantic and depth-guided view transformation and a fusion-to-occupancy-driven active distillation.<n>Our method achieves state-of-the-art (SOTA) performance with real-time processing on the Occ3D-nuScenes dataset.
Score: 8.723840755505817
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Multimodal 3D occupancy prediction has garnered significant attention for its potential in autonomous driving. However, most existing approaches are single-modality: camera-based methods lack depth information, while LiDAR-based methods struggle with occlusions. Current lightweight methods primarily rely on the Lift-Splat-Shoot (LSS) pipeline, which suffers from inaccurate depth estimation and fails to fully exploit the geometric and semantic information of 3D LiDAR points. Therefore, we propose a novel multimodal occupancy prediction network called SDG-OCC, which incorporates a joint semantic and depth-guided view transformation coupled with a fusion-to-occupancy-driven active distillation. The enhanced view transformation constructs accurate depth distributions by integrating pixel semantics and co-point depth through diffusion and bilinear discretization. The fusion-to-occupancy-driven active distillation extracts rich semantic information from multimodal data and selectively transfers knowledge to image features based on LiDAR-identified regions. Finally, for optimal performance, we introduce SDG-Fusion, which uses fusion alone, and SDG-KL, which integrates both fusion and distillation for faster inference. Our method achieves state-of-the-art (SOTA) performance with real-time processing on the Occ3D-nuScenes dataset and shows comparable performance on the more challenging SurroundOcc-nuScenes dataset, demonstrating its effectiveness and robustness. The code will be released at https://github.com/DzpLab/SDGOCC.

Related papers

TACOcc:Target-Adaptive Cross-Modal Fusion with Volume Rendering for 3D Semantic Occupancy [14.075911467687789]
We propose a target-scale adaptive, symmetric retrieval mechanism for 3D semantic occupancy prediction.<n>It expands the neighborhood for large targets to enhance context awareness and shrinks it for small ones to improve efficiency and suppress noise.<n>In summary, we propose TACOcc, an adaptive multi-modal fusion framework for 3D semantic occupancy prediction, enhanced by volume rendering supervision.
arXiv Detail & Related papers (2025-05-19T04:32:36Z)
econSG: Efficient and Multi-view Consistent Open-Vocabulary 3D Semantic Gaussians [56.85804719947]
We propose econSG for open-vocabulary semantic segmentation with 3DGS.<n>Our econSG shows state-of-the-art performance on four benchmark datasets compared to the existing methods.
arXiv Detail & Related papers (2025-04-08T13:12:31Z)
GSPR: Multimodal Place Recognition Using 3D Gaussian Splatting for Autonomous Driving [9.023864430027333]
We propose a 3D Gaussian Splatting-based multimodal place recognition network dubbed GPSR.<n>It explicitly combines multi-view RGB images and LiDAR point clouds into atemporally unified scene representation with the Multimodal Gaussian Splatting.<n>Our method can effectively leverage complementary strengths of both multi-view cameras and LiDAR, achieving SOTA place recognition performance while maintaining solid generalization ability.
arXiv Detail & Related papers (2024-10-01T00:43:45Z)
FSMDet: Vision-guided feature diffusion for fully sparse 3D detector [0.8437187555622164]
We propose FSMDet (Fully Sparse Multi-modal Detection), which use visual information to guide the LiDAR feature diffusion process. Our method can be up to 5 times more efficient than previous SOTA methods in the inference process.
arXiv Detail & Related papers (2024-09-11T01:55:45Z)
GEOcc: Geometrically Enhanced 3D Occupancy Network with Implicit-Explicit Depth Fusion and Contextual Self-Supervision [49.839374549646884]
This paper presents GEOcc, a Geometric-Enhanced Occupancy network tailored for vision-only surround-view perception.<n>Our approach achieves State-Of-The-Art performance on the Occ3D-nuScenes dataset with the least image resolution needed and the most weightless image backbone.
arXiv Detail & Related papers (2024-05-17T07:31:20Z)
Co-Occ: Coupling Explicit Feature Fusion with Volume Rendering Regularization for Multi-Modal 3D Semantic Occupancy Prediction [10.698054425507475]
This letter presents a novel multi-modal, i.e., LiDAR-camera 3D semantic occupancy prediction framework, dubbed Co-Occ. volume rendering in the feature space can proficiently bridge the gap between 3D LiDAR sweeps and 2D images.
arXiv Detail & Related papers (2024-04-06T09:01:19Z)
MVD-Fusion: Single-view 3D via Depth-consistent Multi-view Generation [54.27399121779011]
We present MVD-Fusion: a method for single-view 3D inference via generative modeling of multi-view-consistent RGB-D images. We show that our approach can yield more accurate synthesis compared to recent state-of-the-art, including distillation-based 3D inference and prior multi-view generation methods.
arXiv Detail & Related papers (2024-04-04T17:59:57Z)
OccFusion: Depth Estimation Free Multi-sensor Fusion for 3D Occupancy Prediction [5.285847977231642]
3D occupancy prediction based on multi-sensor fusion,crucial for a reliable autonomous driving system. Previous fusion-based 3D occupancy predictions relied on depth estimation for processing 2D image features. We propose OccFusion, a depth estimation free multi-modal fusion framework.
arXiv Detail & Related papers (2024-03-08T14:07:37Z)
RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering Assisted Distillation [50.35403070279804]
3D occupancy prediction is an emerging task that aims to estimate the occupancy states and semantics of 3D scenes using multi-view images. We propose RadOcc, a Rendering assisted distillation paradigm for 3D Occupancy prediction.
arXiv Detail & Related papers (2023-12-19T03:39:56Z)
Dense Voxel Fusion for 3D Object Detection [10.717415797194896]
Voxel Fusion (DVF) is a sequential fusion method that generates multi-scale dense voxel feature representations. We train directly with ground truth 2D bounding box labels, avoiding noisy, detector-specific, 2D predictions. We show that our proposed multi-modal training strategy results in better generalization compared to training using erroneous 2D predictions.
arXiv Detail & Related papers (2022-03-02T04:51:31Z)
Depth-conditioned Dynamic Message Propagation for Monocular 3D Object Detection [86.25022248968908]
We learn context- and depth-aware feature representation to solve the problem of monocular 3D object detection. We show state-of-the-art results among the monocular-based approaches on the KITTI benchmark dataset.
arXiv Detail & Related papers (2021-03-30T16:20:24Z)
Volumetric Propagation Network: Stereo-LiDAR Fusion for Long-Range Depth Estimation [81.08111209632501]
We propose a geometry-aware stereo-LiDAR fusion network for long-range depth estimation. We exploit sparse and accurate point clouds as a cue for guiding correspondences of stereo images in a unified 3D volume space. Our network achieves state-of-the-art performance on the KITTI and the Virtual- KITTI datasets.
arXiv Detail & Related papers (2021-03-24T03:24:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.