Related papers: MambaFusion: Adaptive State-Space Fusion for Multimodal 3D Object Detection

MambaFusion: Adaptive State-Space Fusion for Multimodal 3D Object Detection

URL: http://arxiv.org/abs/2602.08126v2
Date: Thu, 12 Feb 2026 00:54:39 GMT
Title: MambaFusion: Adaptive State-Space Fusion for Multimodal 3D Object Detection
Authors: Venkatraman Narayanan, Bala Sai, Rahul Ahuja, Pratik Likhar, Varun Ravi Kumar, Senthil Yogamani,
Abstract summary: MambaFusion is a unified multi-modal detection framework that achieves efficient, adaptive, and physically grounded 3D perception.<n>A structure-conditioned diffusion head integrates graph-based reasoning with uncertainty-aware denoising, enforcing physical plausibility, and calibrated confidence.<n>The framework demonstrates that coupling SSM-based efficiency with reliability-driven fusion yields robust, temporally stable, and interpretable 3D perception for real-world autonomous driving systems.
Score: 6.350460753267439
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reliable 3D object detection is fundamental to autonomous driving, and multimodal fusion algorithms using cameras and LiDAR remain a persistent challenge. Cameras provide dense visual cues but ill posed depth; LiDAR provides a precise 3D structure but sparse coverage. Existing BEV-based fusion frameworks have made good progress, but they have difficulties including inefficient context modeling, spatially invariant fusion, and reasoning under uncertainty. We introduce MambaFusion, a unified multi-modal detection framework that achieves efficient, adaptive, and physically grounded 3D perception. MambaFusion interleaves selective state-space models (SSMs) with windowed transformers to propagate the global context in linear time while preserving local geometric fidelity. A multi-modal token alignment (MTA) module and reliability-aware fusion gates dynamically re-weight camera-LiDAR features based on spatial confidence and calibration consistency. Finally, a structure-conditioned diffusion head integrates graph-based reasoning with uncertainty-aware denoising, enforcing physical plausibility, and calibrated confidence. MambaFusion establishes new state-of-the-art performance on nuScenes benchmarks while operating with linear-time complexity. The framework demonstrates that coupling SSM-based efficiency with reliability-driven fusion yields robust, temporally stable, and interpretable 3D perception for real-world autonomous driving systems.

Related papers

VLMFusionOcc3D: VLM Assisted Multi-Modal 3D Semantic Occupancy Prediction [0.0]
VLMFusionOcc3D is a robust multimodal framework for dense 3D semantic occupancy prediction in autonomous driving.<n>We introduce Weather-Aware Adaptive Fusion, a dynamic gating mechanism that utilizes vehicle metadata and weather-conditioned prompts to re-weight sensor contributions.<n>Our approach achieves significant improvements in challenging weather scenarios, offering a scalable and robust solution for complex urban navigation.
arXiv Detail & Related papers (2026-03-03T05:22:28Z)
LiteFusion: Taming 3D Object Detectors from Vision-Based to Multi-Modal with Minimal Adaptation [23.72983078807998]
Current 3D object detectors rely on complex architectures and training strategies to achieve higher detection accuracy.<n>These methods heavily rely on the LiDAR sensor so that they suffer from large performance drops when LiDAR is absent.<n>We introduce a novel multi-modal 3D detector, LiteFusion, which integrates complementary features from LiDAR points into image features within a quaternion space.
arXiv Detail & Related papers (2025-12-23T10:16:33Z)
Lemon: A Unified and Scalable 3D Multimodal Model for Universal Spatial Understanding [80.66591664266744]
Lemon is a unified transformer architecture that processes 3D point cloud patches and language tokens as a single sequence.<n>To handle the complexity of 3D data, we develop a structured patchification and tokenization scheme that preserves spatial context.<n>Lemon establishes new state-of-the-art performance across comprehensive 3D understanding and reasoning tasks.
arXiv Detail & Related papers (2025-12-14T20:02:43Z)
GMF-Drive: Gated Mamba Fusion with Spatial-Aware BEV Representation for End-to-End Autonomous Driving [5.450011907283289]
This paper introduces GMF-Drive, an end-to-end framework that overcomes challenges through two principled innovations.<n>First, we supersede the information-limited histogram-based LiDAR representation with a geometrically-augmented pillar format.<n>Second, we propose a novel hierarchical mamba fusion architecture that substitutes an expensive transformer with a highly efficient, spatially-aware state-space model.
arXiv Detail & Related papers (2025-08-08T08:17:18Z)
Reliability-Driven LiDAR-Camera Fusion for Robust 3D Object Detection [0.0]
We propose ReliFusion, a LiDAR-camera fusion framework operating in the bird's-eye view (BEV) space.<n>ReliFusion integrates three key components: the Spatio-Temporal Feature Aggregation (STFA) module, the Reliability module, and the Confidence-Weighted Mutual Cross-Attention (CW-MCA) module.<n>Experiments on the nuScenes dataset show that ReliFusion significantly outperforms state-of-the-art methods, achieving superior robustness and accuracy in scenarios with limited LiDAR fields of view and severe sensor malfunctions.
arXiv Detail & Related papers (2025-02-03T22:07:14Z)
GEAL: Generalizable 3D Affordance Learning with Cross-Modal Consistency [50.11520458252128]
Existing 3D affordance learning methods struggle with generalization and robustness due to limited annotated data.<n>We propose GEAL, a novel framework designed to enhance the generalization and robustness of 3D affordance learning by leveraging large-scale pre-trained 2D models.<n>GEAL consistently outperforms existing methods across seen and novel object categories, as well as corrupted data.
arXiv Detail & Related papers (2024-12-12T17:59:03Z)
Progressive Multi-Modal Fusion for Robust 3D Object Detection [12.048303829428452]
Existing methods perform sensor fusion in a single view by projecting features from both modalities either in Bird's Eye View (BEV) or Perspective View (PV) We propose ProFusion3D, a progressive fusion framework that combines features in both BEV and PV at both intermediate and object query levels. Our architecture hierarchically fuses local and global features, enhancing the robustness of 3D object detection.
arXiv Detail & Related papers (2024-10-09T22:57:47Z)
UniM$^2$AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving [47.590099762244535]
Masked Autoencoders (MAE) play a pivotal role in learning potent representations, delivering outstanding results across various 3D perception tasks. This research delves into multi-modal Masked Autoencoders tailored for a unified representation space in autonomous driving. To intricately marry the semantics inherent in images with the geometric intricacies of LiDAR point clouds, we propose UniM$2$AE.
arXiv Detail & Related papers (2023-08-21T02:13:40Z)
Multi-Modal 3D Object Detection by Box Matching [109.43430123791684]
We propose a novel Fusion network by Box Matching (FBMNet) for multi-modal 3D detection. With the learned assignments between 3D and 2D object proposals, the fusion for detection can be effectively performed by combing their ROI features.
arXiv Detail & Related papers (2023-05-12T18:08:51Z)
MSMDFusion: Fusing LiDAR and Camera at Multiple Scales with Multi-Depth Seeds for 3D Object Detection [89.26380781863665]
Fusing LiDAR and camera information is essential for achieving accurate and reliable 3D object detection in autonomous driving systems. Recent approaches aim at exploring the semantic densities of camera features through lifting points in 2D camera images into 3D space for fusion. We propose a novel framework that focuses on the multi-scale progressive interaction of the multi-granularity LiDAR and camera features.
arXiv Detail & Related papers (2022-09-07T12:29:29Z)
Benchmarking the Robustness of LiDAR-Camera Fusion for 3D Object Detection [58.81316192862618]
Two critical sensors for 3D perception in autonomous driving are the camera and the LiDAR. fusing these two modalities can significantly boost the performance of 3D perception models. We benchmark the state-of-the-art fusion methods for the first time.
arXiv Detail & Related papers (2022-05-30T09:35:37Z)
EPNet++: Cascade Bi-directional Fusion for Multi-Modal 3D Object Detection [56.03081616213012]
We propose EPNet++ for multi-modal 3D object detection by introducing a novel Cascade Bi-directional Fusion(CB-Fusion) module. The proposed CB-Fusion module boosts the plentiful semantic information of point features with the image features in a cascade bi-directional interaction fusion manner. The experiment results on the KITTI, JRDB and SUN-RGBD datasets demonstrate the superiority of EPNet++ over the state-of-the-art methods.
arXiv Detail & Related papers (2021-12-21T10:48:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.