Related papers: Eliminating Cross-modal Conflicts in BEV Space for LiDAR-Camera 3D Object Detection

Eliminating Cross-modal Conflicts in BEV Space for LiDAR-Camera 3D Object Detection

URL: http://arxiv.org/abs/2403.07372v1
Date: Tue, 12 Mar 2024 07:16:20 GMT
Title: Eliminating Cross-modal Conflicts in BEV Space for LiDAR-Camera 3D Object Detection
Authors: Jiahui Fu, Chen Gao, Zitian Wang, Lirong Yang, Xiaofei Wang, Beipeng Mu, Si Liu
Abstract summary: Recent 3D object detectors typically utilize multi-sensor data and unify multi-modal features in the shared bird's-eye view (BEV) representation space. Previous methods have limitations in generating fusion BEV features free from cross-modal conflicts. We propose a novel Eliminating Conflicts Fusion (ECFusion) method to explicitly eliminate the extrinsic/inherent conflicts in BEV space.
Score: 26.75994759483174
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent 3D object detectors typically utilize multi-sensor data and unify multi-modal features in the shared bird's-eye view (BEV) representation space. However, our empirical findings indicate that previous methods have limitations in generating fusion BEV features free from cross-modal conflicts. These conflicts encompass extrinsic conflicts caused by BEV feature construction and inherent conflicts stemming from heterogeneous sensor signals. Therefore, we propose a novel Eliminating Conflicts Fusion (ECFusion) method to explicitly eliminate the extrinsic/inherent conflicts in BEV space and produce improved multi-modal BEV features. Specifically, we devise a Semantic-guided Flow-based Alignment (SFA) module to resolve extrinsic conflicts via unifying spatial distribution in BEV space before fusion. Moreover, we design a Dissolved Query Recovering (DQR) mechanism to remedy inherent conflicts by preserving objectness clues that are lost in the fusion BEV feature. In general, our method maximizes the effective information utilization of each modality and leverages inter-modal complementarity. Our method achieves state-of-the-art performance in the highly competitive nuScenes 3D object detection dataset. The code is released at https://github.com/fjhzhixi/ECFusion.

Related papers

Cross-modal Offset-guided Dynamic Alignment and Fusion for Weakly Aligned UAV Object Detection [0.0]
Unmanned aerial vehicle (UAV) object detection plays a vital role in applications such as environmental monitoring and urban security.<n>Due to UAV platform motion and asynchronous imaging, spatial misalignment frequently occurs between modalities, leading to weak alignment.<n>We propose Cross-modal Offset-guided Dynamic Alignment and Fusion (CoDAF) to address these issues.
arXiv Detail & Related papers (2025-06-20T04:11:39Z)
Learning to Align and Refine: A Foundation-to-Diffusion Framework for Occlusion-Robust Two-Hand Reconstruction [50.952228546326516]
Two-hand reconstruction from monocular images faces persistent challenges due to complex and dynamic hand postures.<n>Existing approaches struggle with such alignment issues, often resulting in misalignment and penetration artifacts.<n>We propose a dual-stage Foundation-to-Diffusion framework that precisely align 2D prior guidance from vision foundation models.
arXiv Detail & Related papers (2025-03-22T14:42:27Z)
Dual-Domain Homogeneous Fusion with Cross-Modal Mamba and Progressive Decoder for 3D Object Detection [12.77616717954945]
Fusing LiDAR and image features in a homogeneous BEV domain has become popular for 3D object detection in autonomous driving. However, this paradigm is constrained by the excessive feature compression. We propose a Dual-Domain Homogeneous Fusion network (DDHFusion) to overcome these limitations.
arXiv Detail & Related papers (2025-03-12T01:55:02Z)
ContrastAlign: Toward Robust BEV Feature Alignment via Contrastive Learning for Multi-Modal 3D Object Detection [21.05923528672353]
We propose a novel ContrastAlign approach to enhance the alignment of heterogeneous modalities. Our method achieves state-of-the-art performance, with an mAP of 70.3%, surpassing BEVFusion by 1.8% on the nuScenes validation set.
arXiv Detail & Related papers (2024-05-27T06:43:12Z)
IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection [130.394884412296]
We propose IS-Fusion, an innovative multimodal fusion framework. It captures the Instance- and Scene-level contextual information. Is-Fusion essentially differs from existing approaches that only focus on the BEV scene-level fusion.
arXiv Detail & Related papers (2024-03-22T14:34:17Z)
Towards Unified 3D Object Detection via Algorithm and Data Unification [70.27631528933482]
We build the first unified multi-modal 3D object detection benchmark MM- Omni3D and extend the aforementioned monocular detector to its multi-modal version. We name the designed monocular and multi-modal detectors as UniMODE and MM-UniMODE, respectively.
arXiv Detail & Related papers (2024-02-28T18:59:31Z)
Diffusion-Based Particle-DETR for BEV Perception [94.88305708174796]
Bird-Eye-View (BEV) is one of the most widely-used scene representations for visual perception in Autonomous Vehicles (AVs) Recent diffusion-based methods offer a promising approach to uncertainty modeling for visual perception but fail to effectively detect small objects in the large coverage of the BEV. Here, we address this problem by combining the diffusion paradigm with current state-of-the-art 3D object detectors in BEV.
arXiv Detail & Related papers (2023-12-18T09:52:14Z)
DiAD: A Diffusion-based Framework for Multi-class Anomaly Detection [55.48770333927732]
We propose a Difusion-based Anomaly Detection (DiAD) framework for multi-class anomaly detection. It consists of a pixel-space autoencoder, a latent-space Semantic-Guided (SG) network with a connection to the stable diffusion's denoising network, and a feature-space pre-trained feature extractor. Experiments on MVTec-AD and VisA datasets demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2023-12-11T18:38:28Z)
UniBEV: Multi-modal 3D Object Detection with Uniform BEV Encoders for Robustness against Missing Sensor Modalities [7.470926069132259]
We propose an end-to-end multi-modal 3D object detection framework designed for robustness against missing modalities. UniBEV can operate on LiDAR plus camera input, but also on LiDAR-only or camera-only input without retraining. We compare UniBEV to state-of-the-art BEVFusion and MetaBEV on nuScenes over all sensor input combinations.
arXiv Detail & Related papers (2023-09-25T20:22:47Z)
Multi-Modal 3D Object Detection by Box Matching [109.43430123791684]
We propose a novel Fusion network by Box Matching (FBMNet) for multi-modal 3D detection. With the learned assignments between 3D and 2D object proposals, the fusion for detection can be effectively performed by combing their ROI features.
arXiv Detail & Related papers (2023-05-12T18:08:51Z)
Multimodal Industrial Anomaly Detection via Hybrid Fusion [59.16333340582885]
We propose a novel multimodal anomaly detection method with hybrid fusion scheme. Our model outperforms the state-of-the-art (SOTA) methods on both detection and segmentation precision on MVTecD-3 AD dataset.
arXiv Detail & Related papers (2023-03-01T15:48:27Z)
MBDF-Net: Multi-Branch Deep Fusion Network for 3D Object Detection [17.295359521427073]
We propose a Multi-Branch Deep Fusion Network (MBDF-Net) for 3D object detection. In the first stage, our multi-branch feature extraction network utilizes Adaptive Attention Fusion modules to produce cross-modal fusion features from single-modal semantic features. In the second stage, we use a region of interest (RoI) -pooled fusion module to generate enhanced local features for refinement.
arXiv Detail & Related papers (2021-08-29T15:40:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.