IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection
- URL: http://arxiv.org/abs/2403.15241v1
- Date: Fri, 22 Mar 2024 14:34:17 GMT
- Title: IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection
- Authors: Junbo Yin, Jianbing Shen, Runnan Chen, Wei Li, Ruigang Yang, Pascal Frossard, Wenguan Wang,
- Abstract summary: We propose IS-Fusion, an innovative multimodal fusion framework.
It captures the Instance- and Scene-level contextual information.
Is-Fusion essentially differs from existing approaches that only focus on the BEV scene-level fusion.
- Score: 130.394884412296
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Bird's eye view (BEV) representation has emerged as a dominant solution for describing 3D space in autonomous driving scenarios. However, objects in the BEV representation typically exhibit small sizes, and the associated point cloud context is inherently sparse, which leads to great challenges for reliable 3D perception. In this paper, we propose IS-Fusion, an innovative multimodal fusion framework that jointly captures the Instance- and Scene-level contextual information. IS-Fusion essentially differs from existing approaches that only focus on the BEV scene-level fusion by explicitly incorporating instance-level multimodal information, thus facilitating the instance-centric tasks like 3D object detection. It comprises a Hierarchical Scene Fusion (HSF) module and an Instance-Guided Fusion (IGF) module. HSF applies Point-to-Grid and Grid-to-Region transformers to capture the multimodal scene context at different granularities. IGF mines instance candidates, explores their relationships, and aggregates the local multimodal context for each instance. These instances then serve as guidance to enhance the scene feature and yield an instance-aware BEV representation. On the challenging nuScenes benchmark, IS-Fusion outperforms all the published multimodal works to date. Code is available at: https://github.com/yinjunbo/IS-Fusion.
Related papers
- PoIFusion: Multi-Modal 3D Object Detection via Fusion at Points of Interest [65.48057241587398]
PoIFusion fuses information of RGB images and LiDAR point clouds at the point of interest (abbreviated as PoI)
Our approach prevents information loss caused by view transformation and eliminates the computation-intensive global attention.
Remarkably, our PoIFusion achieves 74.9% NDS and 73.4% mAP, setting a state-of-the-art record on the multi-modal 3D object detection benchmark.
arXiv Detail & Related papers (2024-03-14T09:28:12Z) - Eliminating Cross-modal Conflicts in BEV Space for LiDAR-Camera 3D
Object Detection [26.75994759483174]
Recent 3D object detectors typically utilize multi-sensor data and unify multi-modal features in the shared bird's-eye view (BEV) representation space.
Previous methods have limitations in generating fusion BEV features free from cross-modal conflicts.
We propose a novel Eliminating Conflicts Fusion (ECFusion) method to explicitly eliminate the extrinsic/inherent conflicts in BEV space.
arXiv Detail & Related papers (2024-03-12T07:16:20Z) - TSJNet: A Multi-modality Target and Semantic Awareness Joint-driven
Image Fusion Network [2.7387720378113554]
We introduce a target and semantic awareness-driven fusion network called TSJNet.
It comprises fusion, detection, and segmentationworks arranged in a series structure.
It can generate visually pleasing fused results, achieving an average increase of 2.84% and 7.47% in object detection and segmentation mAP @0.5 and mIoU, respectively.
arXiv Detail & Related papers (2024-02-02T08:37:38Z) - ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion [61.37481051263816]
Given a single image of a 3D object, this paper proposes a method (named ConsistNet) that is able to generate multiple images of the same object.
Our method effectively learns 3D consistency over a frozen Zero123 backbone and can generate 16 surrounding views of the object within 40 seconds on a single A100 GPU.
arXiv Detail & Related papers (2023-10-16T12:29:29Z) - SparseFusion: Fusing Multi-Modal Sparse Representations for Multi-Sensor
3D Object Detection [84.09798649295038]
Given that objects occupy only a small part of a scene, finding dense candidates and generating dense representations is noisy and inefficient.
We propose SparseFusion, a novel multi-sensor 3D detection method that exclusively uses sparse candidates and sparse representations.
SparseFusion achieves state-of-the-art performance on the nuScenes benchmark while also running at the fastest speed, even outperforming methods with stronger backbones.
arXiv Detail & Related papers (2023-04-27T17:17:39Z) - BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View
Representation [116.6111047218081]
We introduce BEVFusion, a generic multi-task multi-sensor fusion framework.
It unifies multi-modal features in the shared bird's-eye view representation space.
It achieves 1.3% higher mAP and NDS on 3D object detection and 13.6% higher mIoU on BEV map segmentation, with 1.9x lower cost.
arXiv Detail & Related papers (2022-05-26T17:59:35Z) - ScaleVLAD: Improving Multimodal Sentiment Analysis via Multi-Scale
Fusion of Locally Descriptors [15.042741192427334]
This paper proposes a fusion model named ScaleVLAD to gather multi-Scale representation from text, video, and audio.
Experiments on three popular sentiment analysis benchmarks, IEMOCAP, MOSI, and MOSEI, demonstrate significant gains over baselines.
arXiv Detail & Related papers (2021-12-02T16:09:33Z) - Image Fusion Transformer [75.71025138448287]
In image fusion, images obtained from different sensors are fused to generate a single image with enhanced information.
In recent years, state-of-the-art methods have adopted Convolution Neural Networks (CNNs) to encode meaningful features for image fusion.
We propose a novel Image Fusion Transformer (IFT) where we develop a transformer-based multi-scale fusion strategy.
arXiv Detail & Related papers (2021-07-19T16:42:49Z) - CMF: Cascaded Multi-model Fusion for Referring Image Segmentation [24.942658173937563]
We address the task of referring image segmentation (RIS), which aims at predicting a segmentation mask for the object described by a natural language expression.
We propose a simple yet effective Cascaded Multi-modal Fusion (CMF) module, which stacks multiple atrous convolutional layers in parallel.
Experimental results on four benchmark datasets demonstrate that our method outperforms most state-of-the-art methods.
arXiv Detail & Related papers (2021-06-16T08:18:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.