Dual-Domain Homogeneous Fusion with Cross-Modal Mamba and Progressive Decoder for 3D Object Detection
- URL: http://arxiv.org/abs/2503.08992v2
- Date: Mon, 17 Mar 2025 15:33:08 GMT
- Title: Dual-Domain Homogeneous Fusion with Cross-Modal Mamba and Progressive Decoder for 3D Object Detection
- Authors: Xuzhong Hu, Zaipeng Duan, Pei An, Jun zhang, Jie Ma,
- Abstract summary: Fusing LiDAR and image features in a homogeneous BEV domain has become popular for 3D object detection in autonomous driving.<n>However, this paradigm is constrained by the excessive feature compression.<n>We propose a Dual-Domain Homogeneous Fusion network (DDHFusion) to overcome these limitations.
- Score: 12.77616717954945
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fusing LiDAR and image features in a homogeneous BEV domain has become popular for 3D object detection in autonomous driving. However, this paradigm is constrained by the excessive feature compression. While some works explore dense voxel fusion to enable better feature interaction, they face high computational costs and challenges in query generation. Additionally, feature misalignment in both domains results in suboptimal detection accuracy. To address these limitations, we propose a Dual-Domain Homogeneous Fusion network (DDHFusion), which leverages the complementarily of both BEV and voxel domains while mitigating their drawbacks. Specifically, we first transform image features into BEV and sparse voxel representations using lift-splat-shot and our proposed Semantic-Aware Feature Sampling (SAFS) module. The latter significantly reduces computational overhead by discarding unimportant voxels. Next, we introduce Homogeneous Voxel and BEV Fusion (HVF and HBF) networks for multi-modal fusion within respective domains. They are equipped with novel cross-modal Mamba blocks to resolve feature misalignment and enable comprehensive scene perception. The output voxel features are injected into the BEV space to compensate for the information loss brought by direct height compression. During query selection, the Progressive Query Generation (PQG) mechanism is implemented in the BEV domain to reduce false negatives caused by feature compression. Furthermore, we propose a Progressive Decoder (QD) that sequentially aggregates not only context-rich BEV features but also geometry-aware voxel features with deformable attention and the Multi-Modal Voxel Feature Mixing (MMVFM) block for precise classification and box regression.
Related papers
- SparseVoxFormer: Sparse Voxel-based Transformer for Multi-modal 3D Object Detection [12.941263635455915]
Most previous 3D object detection methods utilize the Bird's Eye View (BEV) space for intermediate feature representation.<n>This paper focuses on the sparse nature of LiDAR point cloud data.<n>We introduce a novel sparse voxel-based transformer network for 3D object detection, dubbed as SparseVoxFormer.
arXiv Detail & Related papers (2025-03-11T06:52:25Z) - V2X-DGPE: Addressing Domain Gaps and Pose Errors for Robust Collaborative 3D Object Detection [18.694510415777632]
V2X-DGPE is a high-accuracy and robust V2X feature-level collaborative perception framework.
The proposed method outperforms existing approaches, achieving state-of-the-art detection performance.
arXiv Detail & Related papers (2025-01-04T19:28:55Z) - A Hybrid Transformer-Mamba Network for Single Image Deraining [70.64069487982916]
Existing deraining Transformers employ self-attention mechanisms with fixed-range windows or along channel dimensions.
We introduce a novel dual-branch hybrid Transformer-Mamba network, denoted as TransMamba, aimed at effectively capturing long-range rain-related dependencies.
arXiv Detail & Related papers (2024-08-31T10:03:19Z) - HEAD: A Bandwidth-Efficient Cooperative Perception Approach for Heterogeneous Connected and Autonomous Vehicles [9.10239345027499]
HEAD is a method that fuses features from the classification and regression heads in 3D object detection networks.
Our experiments demonstrate that HEAD is a fusion method that effectively balances communication bandwidth and perception performance.
arXiv Detail & Related papers (2024-08-27T22:05:44Z) - MR3D-Net: Dynamic Multi-Resolution 3D Sparse Voxel Grid Fusion for LiDAR-Based Collective Perception [0.5714074111744111]
We propose MR3D-Net, a dynamic multi-resolution 3D sparse voxel grid fusion backbone architecture for LiDAR-based collective perception.
We show that sparse voxel grids at varying resolutions provide a meaningful and compact environment representation that can adapt to the communication bandwidth.
arXiv Detail & Related papers (2024-08-12T13:27:11Z) - BiCo-Fusion: Bidirectional Complementary LiDAR-Camera Fusion for Semantic- and Spatial-Aware 3D Object Detection [10.321117046185321]
New trend is to fuse multi-modal inputs, i.e., LiDAR and camera.<n>LiDAR features struggle with detailed semantic information and the camera lacks accurate 3D spatial information.<n>BiCo-Fusion can achieve robust semantic- and spatial-aware 3D object detection.
arXiv Detail & Related papers (2024-06-27T09:56:38Z) - Retain, Blend, and Exchange: A Quality-aware Spatial-Stereo Fusion Approach for Event Stream Recognition [57.74076383449153]
We propose a novel dual-stream framework for event stream-based pattern recognition via differentiated fusion, termed EFV++.
It models two common event representations simultaneously, i.e., event images and event voxels.
We achieve new state-of-the-art performance on the Bullying10k dataset, i.e., $90.51%$, which exceeds the second place by $+2.21%$.
arXiv Detail & Related papers (2024-06-27T02:32:46Z) - Eliminating Cross-modal Conflicts in BEV Space for LiDAR-Camera 3D
Object Detection [26.75994759483174]
Recent 3D object detectors typically utilize multi-sensor data and unify multi-modal features in the shared bird's-eye view (BEV) representation space.
Previous methods have limitations in generating fusion BEV features free from cross-modal conflicts.
We propose a novel Eliminating Conflicts Fusion (ECFusion) method to explicitly eliminate the extrinsic/inherent conflicts in BEV space.
arXiv Detail & Related papers (2024-03-12T07:16:20Z) - Towards Unified 3D Object Detection via Algorithm and Data Unification [70.27631528933482]
We build the first unified multi-modal 3D object detection benchmark MM- Omni3D and extend the aforementioned monocular detector to its multi-modal version.
We name the designed monocular and multi-modal detectors as UniMODE and MM-UniMODE, respectively.
arXiv Detail & Related papers (2024-02-28T18:59:31Z) - DI-V2X: Learning Domain-Invariant Representation for
Vehicle-Infrastructure Collaborative 3D Object Detection [78.09431523221458]
DI-V2X aims to learn Domain-Invariant representations through a new distillation framework.
DI-V2X comprises three essential components: a domain-mixing instance augmentation (DMA) module, a progressive domain-invariant distillation (PDD) module, and a domain-adaptive fusion (DAF) module.
arXiv Detail & Related papers (2023-12-25T14:40:46Z) - Mutual-Guided Dynamic Network for Image Fusion [51.615598671899335]
We propose a novel mutual-guided dynamic network (MGDN) for image fusion, which allows for effective information utilization across different locations and inputs.
Experimental results on five benchmark datasets demonstrate that our proposed method outperforms existing methods on four image fusion tasks.
arXiv Detail & Related papers (2023-08-24T03:50:37Z) - X-Align++: cross-modal cross-view alignment for Bird's-eye-view
segmentation [44.58686493878629]
X-Align is a novel end-to-end cross-modal and cross-view learning framework for BEV segmentation.
X-Align significantly outperforms the state-of-the-art by 3 absolute mIoU points on nuScenes and KITTI-360 datasets.
arXiv Detail & Related papers (2023-06-06T15:52:55Z) - Voxel or Pillar: Exploring Efficient Point Cloud Representation for 3D
Object Detection [49.324070632356296]
We develop a sparse voxel-pillar encoder that encodes point clouds into voxel and pillar features through 3D and 2D sparse convolutions respectively.
Our efficient, fully sparse method can be seamlessly integrated into both dense and sparse detectors.
arXiv Detail & Related papers (2023-04-06T05:00:58Z) - Unifying Voxel-based Representation with Transformer for 3D Object
Detection [143.91910747605107]
We present a unified framework for multi-modality 3D object detection, named UVTR.
The proposed method aims to unify multi-modality representations in the voxel space for accurate and robust single- or cross-modality 3D detection.
UVTR achieves leading performance in the nuScenes test set with 69.7%, 55.1%, and 71.1% NDS for LiDAR, camera, and multi-modality inputs, respectively.
arXiv Detail & Related papers (2022-06-01T17:02:40Z) - Voxel Transformer for 3D Object Detection [133.34678177431914]
Voxel Transformer (VoTr) is a novel and effective voxel-based Transformer backbone for 3D object detection from point clouds.
Our proposed VoTr shows consistent improvement over the convolutional baselines while maintaining computational efficiency on the KITTI dataset and the Open dataset.
arXiv Detail & Related papers (2021-09-06T14:10:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.