DGFusion: Depth-Guided Sensor Fusion for Robust Semantic Perception
- URL: http://arxiv.org/abs/2509.09828v1
- Date: Thu, 11 Sep 2025 20:03:00 GMT
- Title: DGFusion: Depth-Guided Sensor Fusion for Robust Semantic Perception
- Authors: Tim Broedermannn, Christos Sakaridis, Luigi Piccinelli, Wim Abbeloos, Luc Van Gool,
- Abstract summary: State-of-the-art sensor fusion approaches to semantic perception often treat sensor data uniformly across the spatial extent of the input.<n>We propose a novel depth-guided multimodal fusion method that upgrades condition-aware fusion by integrating depth information.<n>Our method achieves state-of-the-art panoptic and semantic segmentation performance on the challenging MUSES and DELIVER datasets.
- Score: 57.77346327566903
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Robust semantic perception for autonomous vehicles relies on effectively combining multiple sensors with complementary strengths and weaknesses. State-of-the-art sensor fusion approaches to semantic perception often treat sensor data uniformly across the spatial extent of the input, which hinders performance when faced with challenging conditions. By contrast, we propose a novel depth-guided multimodal fusion method that upgrades condition-aware fusion by integrating depth information. Our network, DGFusion, poses multimodal segmentation as a multi-task problem, utilizing the lidar measurements, which are typically available in outdoor sensor suites, both as one of the model's inputs and as ground truth for learning depth. Our corresponding auxiliary depth head helps to learn depth-aware features, which are encoded into spatially varying local depth tokens that condition our attentive cross-modal fusion. Together with a global condition token, these local depth tokens dynamically adapt sensor fusion to the spatially varying reliability of each sensor across the scene, which largely depends on depth. In addition, we propose a robust loss for our depth, which is essential for learning from lidar inputs that are typically sparse and noisy in adverse conditions. Our method achieves state-of-the-art panoptic and semantic segmentation performance on the challenging MUSES and DELIVER datasets. Code and models will be available at https://github.com/timbroed/DGFusion
Related papers
- UniCT Depth: Event-Image Fusion Based Monocular Depth Estimation with Convolution-Compensated ViT Dual SA Block [6.994911870644179]
We propose UniCT Depth, an event-image fusion method that unifies CNNs and Transformers to model local and global features.<n>We show that UniCT Depth outperforms existing image, event, and fusion-based monocular depth estimation methods across key metrics.
arXiv Detail & Related papers (2025-07-26T13:29:48Z) - DepthSeg: Depth prompting in remote sensing semantic segmentation [16.93010831616395]
In this paper, we introduce a depth prompting two-dimensional (2D) remote sensing semantic segmentation framework (DepthSeg)<n>It automatically models depth/height information from 2D remote sensing images and integrates it into the semantic segmentation framework.<n>Experiments on the LiuZhou dataset validate the advantages of the DepthSeg framework in land cover mapping tasks.
arXiv Detail & Related papers (2025-06-17T10:27:59Z) - CAFuser: Condition-Aware Multimodal Fusion for Robust Semantic Perception of Driving Scenes [56.52618054240197]
We propose a novel, condition-aware multimodal fusion approach for robust semantic perception of driving scenes.<n>Our method, CAFuser, uses an RGB camera input to classify environmental conditions and generate a Condition Token.<n>Our model significantly improves robustness and accuracy, especially in adverse-condition scenarios.
arXiv Detail & Related papers (2024-10-14T17:56:20Z) - Unveiling the Depths: A Multi-Modal Fusion Framework for Challenging
Scenarios [103.72094710263656]
This paper presents a novel approach that identifies and integrates dominant cross-modality depth features with a learning-based framework.
We propose a novel confidence loss steering a confidence predictor network to yield a confidence map specifying latent potential depth areas.
With the resulting confidence map, we propose a multi-modal fusion network that fuses the final depth in an end-to-end manner.
arXiv Detail & Related papers (2024-02-19T04:39:16Z) - Cognitive TransFuser: Semantics-guided Transformer-based Sensor Fusion
for Improved Waypoint Prediction [38.971222477695214]
RGB-LIDAR-based multi-task feature fusion network, coined Cognitive TransFuser, augments and exceeds the baseline network by a significant margin for safer and more complete road navigation.
We validate the proposed network on the Town05 Short and Town05 Long Benchmark through extensive experiments, achieving up to 44.2 FPS real-time inference time.
arXiv Detail & Related papers (2023-08-04T03:59:10Z) - Object Segmentation by Mining Cross-Modal Semantics [68.88086621181628]
We propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features.
Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision.
arXiv Detail & Related papers (2023-05-17T14:30:11Z) - Learning Online Multi-Sensor Depth Fusion [100.84519175539378]
SenFuNet is a depth fusion approach that learns sensor-specific noise and outlier statistics.
We conduct experiments with various sensor combinations on the real-world CoRBS and Scene3D datasets.
arXiv Detail & Related papers (2022-04-07T10:45:32Z) - GEM: Glare or Gloom, I Can Still See You -- End-to-End Multimodal Object
Detector [11.161639542268015]
We propose sensor-aware multi-modal fusion strategies for 2D object detection in harsh-lighting conditions.
Our network learns to estimate the measurement reliability of each sensor modality in the form of scalar weights and masks.
We show that the proposed strategies out-perform the existing state-of-the-art methods on the FLIR-Thermal dataset.
arXiv Detail & Related papers (2021-02-24T14:56:37Z) - Learning Selective Sensor Fusion for States Estimation [47.76590539558037]
We propose SelectFusion, an end-to-end selective sensor fusion module.
During prediction, the network is able to assess the reliability of the latent features from different sensor modalities.
We extensively evaluate all fusion strategies in both public datasets and on progressively degraded datasets.
arXiv Detail & Related papers (2019-12-30T20:25:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.