Related papers: Diffusion-Based Restoration for Multi-Modal 3D Object Detection in Adverse Weather

Diffusion-Based Restoration for Multi-Modal 3D Object Detection in Adverse Weather

URL: http://arxiv.org/abs/2512.13107v2
Date: Thu, 18 Dec 2025 16:00:41 GMT
Title: Diffusion-Based Restoration for Multi-Modal 3D Object Detection in Adverse Weather
Authors: Zhijian He, Feifei Liu, Yuwei Li, Zhanpeng Luo, Jintao Cheng, Xieyuanli Chen, Xiaoyu Tang,
Abstract summary: DiffFusion is a novel framework designed to enhance robustness in challenging weather.<n>Our key insight is that diffusion models possess strong capabilities for denoising and generating data.<n> implementation of our DiffFusion will be released as open-source.
Score: 15.57759675028067
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-modal 3D object detection is important for reliable perception in robotics and autonomous driving. However, its effectiveness remains limited under adverse weather conditions due to weather-induced distortions and misalignment between different data modalities. In this work, we propose DiffFusion, a novel framework designed to enhance robustness in challenging weather through diffusion-based restoration and adaptive cross-modal fusion. Our key insight is that diffusion models possess strong capabilities for denoising and generating data that can adapt to various weather conditions. Building on this, DiffFusion introduces Diffusion-IR restoring images degraded by weather effects and Point Cloud Restoration (PCR) compensating for corrupted LiDAR data using image object cues. To tackle misalignments between two modalities, we develop Bidirectional Adaptive Fusion and Alignment Module (BAFAM). It enables dynamic multi-modal fusion and bidirectional bird's-eye view (BEV) alignment to maintain consistent spatial correspondence. Extensive experiments on three public datasets show that DiffFusion achieves state-of-the-art robustness under adverse weather while preserving strong clean-data performance. Zero-shot results on the real-world DENSE dataset further validate its generalization. The implementation of our DiffFusion will be released as open-source.

Related papers

2D_3D Feature Fusion via Cross-Modal Latent Synthesis and Attention Guided Restoration for Industrial Anomaly Detection [9.873449426376787]
We propose a novel unsupervised framework, Multi-Modal Attention-Driven Fusion Restoration (MAFR)<n>MAFR synthesises a unified latent space from RGB images and point clouds using a shared fusion encoder, followed by attention-guided, modality-specific decoders.<n>Anomalies are localised by measuring reconstruction errors between input features and their restored counterparts.
arXiv Detail & Related papers (2025-10-20T03:57:50Z)
FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion [92.4205087439928]
Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability.<n>We propose the Self-supervised Transfer (PST) and the FrequencyDe-coupled Fusion module (FreDF)<n>PST establishes cross-modal knowledge transfer through latent space alignment with image foundation models, effectively mitigating data scarcity.<n>FreDF explicitly decouples high-frequency edge features from low-frequency structural components, resolving modality-specific frequency mismatches.<n>This combined approach enables FUSE to construct a universal image-event that only requires lightweight decoder adaptation for target datasets.
arXiv Detail & Related papers (2025-03-25T15:04:53Z)
Learning to Align and Refine: A Foundation-to-Diffusion Framework for Occlusion-Robust Two-Hand Reconstruction [50.952228546326516]
Two-hand reconstruction from monocular images faces persistent challenges due to complex and dynamic hand postures.<n>Existing approaches struggle with such alignment issues, often resulting in misalignment and penetration artifacts.<n>We propose a dual-stage Foundation-to-Diffusion framework that precisely align 2D prior guidance from vision foundation models.
arXiv Detail & Related papers (2025-03-22T14:42:27Z)
FOF-X: Towards Real-time Detailed Human Reconstruction from a Single Image [64.96903230497755]
We introduce FOF-X for real-time reconstruction of detailed human geometry from a single image.<n>The core of FOF is to factorize a 3D occupancy field into a 2D vector field, retaining topology and spatial relationships within the 3D domain.<n>Based on FOF, we design a new reconstruction framework, FOF-X, to avoid the performance degradation caused by texture and lighting.
arXiv Detail & Related papers (2024-12-08T14:46:29Z)
Progressive Multi-Modal Fusion for Robust 3D Object Detection [12.048303829428452]
Existing methods perform sensor fusion in a single view by projecting features from both modalities either in Bird's Eye View (BEV) or Perspective View (PV) We propose ProFusion3D, a progressive fusion framework that combines features in both BEV and PV at both intermediate and object query levels. Our architecture hierarchically fuses local and global features, enhancing the robustness of 3D object detection.
arXiv Detail & Related papers (2024-10-09T22:57:47Z)
CFMW: Cross-modality Fusion Mamba for Robust Object Detection under Adverse Weather [15.472015859766069]
We propose the Cross-modality Fusion Mamba with Weather-removal (CFMW) to augment stability and cost-effectiveness under adverse weather conditions.<n>CFMW is able to reconstruct visual features affected by adverse weather, enriching the representation of image details.<n>To bridge the gap in relevant datasets, we construct a new Severe Weather Visible-Infrared (SWVI) dataset.
arXiv Detail & Related papers (2024-04-25T02:54:11Z)
ContextualFusion: Context-Based Multi-Sensor Fusion for 3D Object Detection in Adverse Operating Conditions [1.7537812081430004]
We propose a technique called ContextualFusion to incorporate the domain knowledge about cameras and lidars behaving differently across lighting and weather variations into 3D object detection models. Our approach yields an mAP improvement of 6.2% over state-of-the-art methods on our context-balanced synthetic dataset. Our method enhances state-of-the-art 3D objection performance at night on the real-world NuScenes dataset with a significant mAP improvement of 11.7%.
arXiv Detail & Related papers (2024-04-23T06:37:54Z)
Improving Robustness of LiDAR-Camera Fusion Model against Weather Corruption from Fusion Strategy Perspective [26.391161934274876]
LiDAR-camera fusion models have advanced 3D object detection tasks in autonomous driving. robustness against common weather corruption such as fog, rain, snow, and sunlight in the intricate physical world remains underexplored. We propose a concise yet practical fusion strategy to enhance the robustness of the fusion models.
arXiv Detail & Related papers (2024-02-05T05:38:50Z)
D-SCo: Dual-Stream Conditional Diffusion for Monocular Hand-Held Object Reconstruction [74.49121940466675]
We introduce centroid-fixed dual-stream conditional diffusion for monocular hand-held object reconstruction. First, to avoid the object centroid from deviating, we utilize a novel hand-constrained centroid fixing paradigm. Second, we introduce a dual-stream denoiser to semantically and geometrically model hand-object interactions.
arXiv Detail & Related papers (2023-11-23T20:14:50Z)
Benchmarking the Robustness of LiDAR-Camera Fusion for 3D Object Detection [58.81316192862618]
Two critical sensors for 3D perception in autonomous driving are the camera and the LiDAR. fusing these two modalities can significantly boost the performance of 3D perception models. We benchmark the state-of-the-art fusion methods for the first time.
arXiv Detail & Related papers (2022-05-30T09:35:37Z)
EPNet++: Cascade Bi-directional Fusion for Multi-Modal 3D Object Detection [56.03081616213012]
We propose EPNet++ for multi-modal 3D object detection by introducing a novel Cascade Bi-directional Fusion(CB-Fusion) module. The proposed CB-Fusion module boosts the plentiful semantic information of point features with the image features in a cascade bi-directional interaction fusion manner. The experiment results on the KITTI, JRDB and SUN-RGBD datasets demonstrate the superiority of EPNet++ over the state-of-the-art methods.
arXiv Detail & Related papers (2021-12-21T10:48:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.