BEVDilation: LiDAR-Centric Multi-Modal Fusion for 3D Object Detection
- URL: http://arxiv.org/abs/2512.02972v1
- Date: Tue, 02 Dec 2025 17:50:33 GMT
- Title: BEVDilation: LiDAR-Centric Multi-Modal Fusion for 3D Object Detection
- Authors: Guowen Zhang, Chenhang He, Liyi Chen, Lei Zhang,
- Abstract summary: We propose BEVDilation, a novel framework that prioritizes LiDAR information in the fusion.<n>Our strategy effectively alleviates the spatial misalignment caused by image depth estimation errors.<n>On the challenging nuScenes benchmark, BEVDilation achieves better performance than state-of-the-art methods.
- Score: 17.604622218531155
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Integrating LiDAR and camera information in the bird's eye view (BEV) representation has demonstrated its effectiveness in 3D object detection. However, because of the fundamental disparity in geometric accuracy between these sensors, indiscriminate fusion in previous methods often leads to degraded performance. In this paper, we propose BEVDilation, a novel LiDAR-centric framework that prioritizes LiDAR information in the fusion. By formulating image BEV features as implicit guidance rather than naive concatenation, our strategy effectively alleviates the spatial misalignment caused by image depth estimation errors. Furthermore, the image guidance can effectively help the LiDAR-centric paradigm to address the sparsity and semantic limitations of point clouds. Specifically, we propose a Sparse Voxel Dilation Block that mitigates the inherent point sparsity by densifying foreground voxels through image priors. Moreover, we introduce a Semantic-Guided BEV Dilation Block to enhance the LiDAR feature diffusion processing with image semantic guidance and long-range context capture. On the challenging nuScenes benchmark, BEVDilation achieves better performance than state-of-the-art methods while maintaining competitive computational efficiency. Importantly, our LiDAR-centric strategy demonstrates greater robustness to depth noise compared to naive fusion. The source code is available at https://github.com/gwenzhang/BEVDilation.
Related papers
- DensifyBeforehand: LiDAR-assisted Content-aware Densification for Efficient and Quality 3D Gaussian Splatting [1.5576275034099496]
This paper addresses the limitations of existing 3D Gaussian Splatting (3DGS) methods by combining sparse LiDAR data with monocular depth estimation from corresponding RGB images.<n>Our ROI-aware sampling scheme prioritizes semantically and geometrically important regions, yielding a dense point cloud.<n>Our method achieves comparable results to state-of-the-art techniques while significantly lowering resource consumption and training time.
arXiv Detail & Related papers (2025-11-24T16:39:13Z) - SDGOCC: Semantic and Depth-Guided Bird's-Eye View Transformation for 3D Multimodal Occupancy Prediction [8.723840755505817]
We propose a novel multimodal occupancy prediction network called SDG-OCC.<n>It incorporates a joint semantic and depth-guided view transformation and a fusion-to-occupancy-driven active distillation.<n>Our method achieves state-of-the-art (SOTA) performance with real-time processing on the Occ3D-nuScenes dataset.
arXiv Detail & Related papers (2025-07-22T23:49:40Z) - Physically Based Neural LiDAR Resimulation [4.349248791803596]
We show that our method achieves more accurate LiDAR simulation compared to existing techniques.<n>Our approach exhibits advanced resimulation capabilities, such as generating high resolution LiDAR scans in the camera perspective.
arXiv Detail & Related papers (2025-07-15T19:49:44Z) - Depth3DLane: Monocular 3D Lane Detection via Depth Prior Distillation [9.125062959539699]
We introduce a BEV-based framework to address limitations and improve 3D lane detection accuracy.<n>We leverage Depth Prior Distillation to transfer semantic depth knowledge from a teacher model.<n>Our method achieves state-of-the-art performance in terms of z-axis error.
arXiv Detail & Related papers (2025-04-25T13:08:41Z) - FLARES: Fast and Accurate LiDAR Multi-Range Semantic Segmentation [52.89847760590189]
3D scene understanding is a critical yet challenging task in autonomous driving.<n>Recent methods leverage the range-view representation to improve processing efficiency.<n>We re-design the workflow for range-view-based LiDAR semantic segmentation.
arXiv Detail & Related papers (2025-02-13T12:39:26Z) - RaCFormer: Towards High-Quality 3D Object Detection via Query-based Radar-Camera Fusion [58.77329237533034]
We propose a Radar-Camera fusion transformer (RaCFormer) to boost the accuracy of 3D object detection.<n>RaCFormer achieves superior results of 64.9% mAP and 70.2% on nuScenes datasets.
arXiv Detail & Related papers (2024-12-17T09:47:48Z) - Effective and Efficient Adversarial Detection for Vision-Language Models via A Single Vector [97.92369017531038]
We build a new laRge-scale Adervsarial images dataset with Diverse hArmful Responses (RADAR)
We then develop a novel iN-time Embedding-based AdveRSarial Image DEtection (NEARSIDE) method, which exploits a single vector that distilled from the hidden states of Visual Language Models (VLMs) to achieve the detection of adversarial images against benign ones in the input.
arXiv Detail & Related papers (2024-10-30T10:33:10Z) - Fine-grained Image-to-LiDAR Contrastive Distillation with Visual Foundation Models [55.99654128127689]
Visual Foundation Models (VFMs) are used to generate semantic labels for weakly-supervised pixel-to-point contrastive distillation.<n>We adapt sampling probabilities of points to address imbalances in spatial distribution and category frequency.<n>Our approach consistently surpasses existing image-to-LiDAR contrastive distillation methods in downstream tasks.
arXiv Detail & Related papers (2024-05-23T07:48:19Z) - LIF-Seg: LiDAR and Camera Image Fusion for 3D LiDAR Semantic
Segmentation [78.74202673902303]
We propose a coarse-tofine LiDAR and camera fusion-based network (termed as LIF-Seg) for LiDAR segmentation.
The proposed method fully utilizes the contextual information of images and introduces a simple but effective early-fusion strategy.
The cooperation of these two components leads to the success of the effective camera-LiDAR fusion.
arXiv Detail & Related papers (2021-08-17T08:53:11Z) - Depth-conditioned Dynamic Message Propagation for Monocular 3D Object
Detection [86.25022248968908]
We learn context- and depth-aware feature representation to solve the problem of monocular 3D object detection.
We show state-of-the-art results among the monocular-based approaches on the KITTI benchmark dataset.
arXiv Detail & Related papers (2021-03-30T16:20:24Z) - SelfVoxeLO: Self-supervised LiDAR Odometry with Voxel-based Deep Neural
Networks [81.64530401885476]
We propose a self-supervised LiDAR odometry method, dubbed SelfVoxeLO, to tackle these two difficulties.
Specifically, we propose a 3D convolution network to process the raw LiDAR data directly, which extracts features that better encode the 3D geometric patterns.
We evaluate our method's performances on two large-scale datasets, i.e., KITTI and Apollo-SouthBay.
arXiv Detail & Related papers (2020-10-19T09:23:39Z) - Depth Completion via Inductive Fusion of Planar LIDAR and Monocular
Camera [27.978780155504467]
We introduce an inductive late-fusion block which better fuses different sensor modalities inspired by a probability model.
This block uses the dense context features to guide the depth prediction based on demonstrations by sparse depth features.
Our method shows promising results compared to previous approaches on both the benchmark datasets and simulated dataset.
arXiv Detail & Related papers (2020-09-03T18:39:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.