Related papers: UP-Fuse: Uncertainty-guided LiDAR-Camera Fusion for 3D Panoptic Segmentation

UP-Fuse: Uncertainty-guided LiDAR-Camera Fusion for 3D Panoptic Segmentation

URL: http://arxiv.org/abs/2602.19349v1
Date: Sun, 22 Feb 2026 21:34:29 GMT
Title: UP-Fuse: Uncertainty-guided LiDAR-Camera Fusion for 3D Panoptic Segmentation
Authors: Rohit Mohan, Florian Drews, Yakov Miron, Daniele Cattaneo, Abhinav Valada,
Abstract summary: We introduce UP-Fuse, a novel uncertainty-aware fusion framework in the 2D range-view.<n>Raw LiDAR data is first projected into the range-view and encoded by a LiDAR encoder.<n>Camera features are simultaneously extracted and projected into the same shared space.
Score: 17.310791153991975
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: LiDAR-camera fusion enhances 3D panoptic segmentation by leveraging camera images to complement sparse LiDAR scans, but it also introduces a critical failure mode. Under adverse conditions, degradation or failure of the camera sensor can significantly compromise the reliability of the perception system. To address this problem, we introduce UP-Fuse, a novel uncertainty-aware fusion framework in the 2D range-view that remains robust under camera sensor degradation, calibration drift, and sensor failure. Raw LiDAR data is first projected into the range-view and encoded by a LiDAR encoder, while camera features are simultaneously extracted and projected into the same shared space. At its core, UP-Fuse employs an uncertainty-guided fusion module that dynamically modulates cross-modal interaction using predicted uncertainty maps. These maps are learned by quantifying representational divergence under diverse visual degradations, ensuring that only reliable visual cues influence the fused representation. The fused range-view features are decoded by a novel hybrid 2D-3D transformer that mitigates spatial ambiguities inherent to the 2D projection and directly predicts 3D panoptic segmentation masks. Extensive experiments on Panoptic nuScenes, SemanticKITTI, and our introduced Panoptic Waymo benchmark demonstrate the efficacy and robustness of UP-Fuse, which maintains strong performance even under severe visual corruption or misalignment, making it well suited for robotic perception in safety-critical settings.

Related papers

A Comparative Study of 3D Person Detection: Sensor Modalities and Robustness in Diverse Indoor and Outdoor Environments [5.89179309980335]
This work presents a systematic evaluation of 3D person detection using camera-only, LiDAR-only, and camera-LiDAR fusion.<n>We compare three representative models - BEVDepth (camera), PointPillars (LiDAR), and DAL (camera-LiDAR fusion)<n>Our results show that the fusion-based approach consistently outperforms single-modality models, particularly in challenging scenarios.
arXiv Detail & Related papers (2026-02-05T10:53:35Z)
Semantic Causality-Aware Vision-Based 3D Occupancy Prediction [63.752869043357585]
Vision-based 3D semantic occupancy prediction is a critical task in 3D vision.<n>Existing methods, however, often rely on modular pipelines.<n>We propose a novel causal loss that enables holistic, end-to-end supervision of the modular 2D-to-3D transformation pipeline.
arXiv Detail & Related papers (2025-09-10T08:29:22Z)
Adaptive Dual Uncertainty Optimization: Boosting Monocular 3D Object Detection under Test-Time Shifts [80.32933059529135]
Test-Time Adaptation (TTA) methods have emerged to adapt to target distributions during inference.<n>We propose Dual Uncertainty Optimization (DUO), the first TTA framework designed to jointly minimize both uncertainties for robust M3OD.<n>In parallel, we design a semantic-aware normal field constraint that preserves geometric coherence in regions with clear semantic cues.
arXiv Detail & Related papers (2025-08-28T07:09:21Z)
Look Before You Fuse: 2D-Guided Cross-Modal Alignment for Robust 3D Detection [7.448164560761331]
Existing methods suffer from spatial misalignment between LiDAR and camera features.<n>The root cause of this misalignment lies in projection errors, stemming from calibration inaccuracies and rolling shutter effect.<n>We introduce Discontinuity Aware Geometric Fusion to suppress residual noise from PGDC and explicitly enhance sharp depth transitions at object-background boundaries.<n>Our method achieves SOTA performance on nuScenes validation dataset, with its mAP and NDS reaching 71.5% and 73.6% respectively.
arXiv Detail & Related papers (2025-07-21T18:12:22Z)
SemanticBEVFusion: Rethink LiDAR-Camera Fusion in Unified Bird's-Eye View Representation for 3D Object Detection [14.706717531900708]
LiDAR and camera are two essential sensors for 3D object detection in autonomous driving. Recent methods focus on point-level fusion which paints the LiDAR point cloud with camera features in the perspective view. We present SemanticBEVFusion to deeply fuse camera features with LiDAR features in a unified BEV representation.
arXiv Detail & Related papers (2022-12-09T05:48:58Z)
From One to Many: Dynamic Cross Attention Networks for LiDAR and Camera Fusion [12.792769704561024]
Existing fusion methods tend to align each 3D point to only one projected image pixel based on calibration. We propose a Dynamic Cross Attention (DCA) module with a novel one-to-many cross-modality mapping. The whole fusion architecture named Dynamic Cross Attention Network (DCAN) exploits multi-level image features and adapts to multiple representations of point clouds.
arXiv Detail & Related papers (2022-09-25T16:10:14Z)
MSMDFusion: Fusing LiDAR and Camera at Multiple Scales with Multi-Depth Seeds for 3D Object Detection [89.26380781863665]
Fusing LiDAR and camera information is essential for achieving accurate and reliable 3D object detection in autonomous driving systems. Recent approaches aim at exploring the semantic densities of camera features through lifting points in 2D camera images into 3D space for fusion. We propose a novel framework that focuses on the multi-scale progressive interaction of the multi-granularity LiDAR and camera features.
arXiv Detail & Related papers (2022-09-07T12:29:29Z)
TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers [49.689566246504356]
We propose TransFusion, a robust solution to LiDAR-camera fusion with a soft-association mechanism to handle inferior image conditions. TransFusion achieves state-of-the-art performance on large-scale datasets. We extend the proposed method to the 3D tracking task and achieve the 1st place in the leaderboard of nuScenes tracking.
arXiv Detail & Related papers (2022-03-22T07:15:13Z)
EPNet++: Cascade Bi-directional Fusion for Multi-Modal 3D Object Detection [56.03081616213012]
We propose EPNet++ for multi-modal 3D object detection by introducing a novel Cascade Bi-directional Fusion(CB-Fusion) module. The proposed CB-Fusion module boosts the plentiful semantic information of point features with the image features in a cascade bi-directional interaction fusion manner. The experiment results on the KITTI, JRDB and SUN-RGBD datasets demonstrate the superiority of EPNet++ over the state-of-the-art methods.
arXiv Detail & Related papers (2021-12-21T10:48:34Z)
LIF-Seg: LiDAR and Camera Image Fusion for 3D LiDAR Semantic Segmentation [78.74202673902303]
We propose a coarse-tofine LiDAR and camera fusion-based network (termed as LIF-Seg) for LiDAR segmentation. The proposed method fully utilizes the contextual information of images and introduces a simple but effective early-fusion strategy. The cooperation of these two components leads to the success of the effective camera-LiDAR fusion.
arXiv Detail & Related papers (2021-08-17T08:53:11Z)
3D-CVF: Generating Joint Camera and LiDAR Features Using Cross-View Spatial Feature Fusion for 3D Object Detection [10.507404260449333]
We propose a new architecture for fusing camera and LiDAR sensors for 3D object detection. The proposed 3D-CVF achieves state-of-the-art performance in the KITTI benchmark.
arXiv Detail & Related papers (2020-04-27T08:34:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.