CAFuser: Condition-Aware Multimodal Fusion for Robust Semantic Perception of Driving Scenes
- URL: http://arxiv.org/abs/2410.10791v2
- Date: Mon, 27 Jan 2025 13:45:16 GMT
- Title: CAFuser: Condition-Aware Multimodal Fusion for Robust Semantic Perception of Driving Scenes
- Authors: Tim Broedermann, Christos Sakaridis, Yuqian Fu, Luc Van Gool,
- Abstract summary: We propose a novel, condition-aware multimodal fusion approach for robust semantic perception of driving scenes.
Our method, CAFuser, uses an RGB camera input to classify environmental conditions and generate a Condition Token.
Our model significantly improves robustness and accuracy, especially in adverse-condition scenarios.
- Score: 56.52618054240197
- License:
- Abstract: Leveraging multiple sensors is crucial for robust semantic perception in autonomous driving, as each sensor type has complementary strengths and weaknesses. However, existing sensor fusion methods often treat sensors uniformly across all conditions, leading to suboptimal performance. By contrast, we propose a novel, condition-aware multimodal fusion approach for robust semantic perception of driving scenes. Our method, CAFuser, uses an RGB camera input to classify environmental conditions and generate a Condition Token that guides the fusion of multiple sensor modalities. We further newly introduce modality-specific feature adapters to align diverse sensor inputs into a shared latent space, enabling efficient integration with a single and shared pre-trained backbone. By dynamically adapting sensor fusion based on the actual condition, our model significantly improves robustness and accuracy, especially in adverse-condition scenarios. CAFuser ranks first on the public MUSES benchmarks, achieving 59.7 PQ for multimodal panoptic and 78.2 mIoU for semantic segmentation, and also sets the new state of the art on DeLiVER. The source code is publicly available at: https://github.com/timbroed/CAFuser.
Related papers
- Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - Virtual Fusion with Contrastive Learning for Single Sensor-based
Activity Recognition [5.225544155289783]
Various types of sensors can be used for Human Activity Recognition (HAR)
Sometimes a single sensor cannot fully observe the user's motions from its perspective, which causes wrong predictions.
We propose Virtual Fusion - a new method that takes advantage of unlabeled data from multiple time-synchronized sensors during training, but only needs one sensor for inference.
arXiv Detail & Related papers (2023-12-01T17:03:27Z) - Multi-Modal 3D Object Detection by Box Matching [109.43430123791684]
We propose a novel Fusion network by Box Matching (FBMNet) for multi-modal 3D detection.
With the learned assignments between 3D and 2D object proposals, the fusion for detection can be effectively performed by combing their ROI features.
arXiv Detail & Related papers (2023-05-12T18:08:51Z) - RMMDet: Road-Side Multitype and Multigroup Sensor Detection System for
Autonomous Driving [3.8917150802484994]
RMMDet is a road-side multitype and multigroup sensor detection system for autonomous driving.
We use a ROS-based virtual environment to simulate real-world conditions.
We produce local datasets and real sand table field, and conduct various experiments.
arXiv Detail & Related papers (2023-03-09T12:13:39Z) - Safety-Enhanced Autonomous Driving Using Interpretable Sensor Fusion
Transformer [28.15612357340141]
We propose a safety-enhanced autonomous driving framework, named Interpretable Sensor Fusion Transformer(InterFuser)
We process and fuse information from multi-modal multi-view sensors for achieving comprehensive scene understanding and adversarial event detection.
Our framework provides more semantics and are exploited to better constrain actions to be within the safe sets.
arXiv Detail & Related papers (2022-07-28T11:36:21Z) - AFT-VO: Asynchronous Fusion Transformers for Multi-View Visual Odometry
Estimation [39.351088248776435]
We propose AFT-VO, a novel transformer-based sensor fusion architecture to estimate VO from multiple sensors.
Our framework combines predictions from asynchronous multi-view cameras and accounts for the time discrepancies of measurements coming from different sources.
Our experiments demonstrate that multi-view fusion for VO estimation provides robust and accurate trajectories, outperforming the state of the art in both challenging weather and lighting conditions.
arXiv Detail & Related papers (2022-06-26T19:29:08Z) - HydraFusion: Context-Aware Selective Sensor Fusion for Robust and
Efficient Autonomous Vehicle Perception [9.975955132759385]
Techniques to fuse sensor data from camera, radar, and lidar sensors have been proposed to improve autonomous vehicle (AV) perception.
Existing methods are insufficiently robust in difficult driving contexts due to rigidity in their fusion implementations.
We propose HydraFusion: a selective sensor fusion framework that learns to identify the current driving context and fuses the best combination of sensors.
arXiv Detail & Related papers (2022-01-17T22:19:53Z) - Multimodal Object Detection via Bayesian Fusion [59.31437166291557]
We study multimodal object detection with RGB and thermal cameras, since the latter can provide much stronger object signatures under poor illumination.
Our key contribution is a non-learned late-fusion method that fuses together bounding box detections from different modalities.
We apply our approach to benchmarks containing both aligned (KAIST) and unaligned (FLIR) multimodal sensor data.
arXiv Detail & Related papers (2021-04-07T04:03:20Z) - Deep Soft Procrustes for Markerless Volumetric Sensor Alignment [81.13055566952221]
In this work, we improve markerless data-driven correspondence estimation to achieve more robust multi-sensor spatial alignment.
We incorporate geometric constraints in an end-to-end manner into a typical segmentation based model and bridge the intermediate dense classification task with the targeted pose estimation one.
Our model is experimentally shown to achieve similar results with marker-based methods and outperform the markerless ones, while also being robust to the pose variations of the calibration structure.
arXiv Detail & Related papers (2020-03-23T10:51:32Z) - Learning Selective Sensor Fusion for States Estimation [47.76590539558037]
We propose SelectFusion, an end-to-end selective sensor fusion module.
During prediction, the network is able to assess the reliability of the latent features from different sensor modalities.
We extensively evaluate all fusion strategies in both public datasets and on progressively degraded datasets.
arXiv Detail & Related papers (2019-12-30T20:25:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.