IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection
- URL: http://arxiv.org/abs/2403.15241v1
- Date: Fri, 22 Mar 2024 14:34:17 GMT
- Title: IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection
- Authors: Junbo Yin, Jianbing Shen, Runnan Chen, Wei Li, Ruigang Yang, Pascal Frossard, Wenguan Wang,
- Abstract summary: We propose IS-Fusion, an innovative multimodal fusion framework.
It captures the Instance- and Scene-level contextual information.
Is-Fusion essentially differs from existing approaches that only focus on the BEV scene-level fusion.
- Score: 130.394884412296
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Bird's eye view (BEV) representation has emerged as a dominant solution for describing 3D space in autonomous driving scenarios. However, objects in the BEV representation typically exhibit small sizes, and the associated point cloud context is inherently sparse, which leads to great challenges for reliable 3D perception. In this paper, we propose IS-Fusion, an innovative multimodal fusion framework that jointly captures the Instance- and Scene-level contextual information. IS-Fusion essentially differs from existing approaches that only focus on the BEV scene-level fusion by explicitly incorporating instance-level multimodal information, thus facilitating the instance-centric tasks like 3D object detection. It comprises a Hierarchical Scene Fusion (HSF) module and an Instance-Guided Fusion (IGF) module. HSF applies Point-to-Grid and Grid-to-Region transformers to capture the multimodal scene context at different granularities. IGF mines instance candidates, explores their relationships, and aggregates the local multimodal context for each instance. These instances then serve as guidance to enhance the scene feature and yield an instance-aware BEV representation. On the challenging nuScenes benchmark, IS-Fusion outperforms all the published multimodal works to date. Code is available at: https://github.com/yinjunbo/IS-Fusion.
Related papers
- FGU3R: Fine-Grained Fusion via Unified 3D Representation for Multimodal 3D Object Detection [10.070120335536075]
Multimodal 3D object detection has garnered considerable interest in autonomous driving.
However, multimodal detectors suffer from dimension mismatches that derive from fusing 3D points with 2D pixels coarsely.
We propose a multimodal framework FGU3R to tackle the issue via unified 3D representation and fine-grained fusion.
arXiv Detail & Related papers (2025-01-08T09:26:36Z) - HV-BEV: Decoupling Horizontal and Vertical Feature Sampling for Multi-View 3D Object Detection [34.72603963887331]
The application of vision-based multi-view environmental perception system has been increasingly recognized in autonomous driving technology.<n>Current state-of-the-art solutions primarily encode image features from each camera view into the BEV space through explicit or implicit depth prediction.<n>We propose a novel approach that decouples feature sampling in the textbfBEV grid queries paradigm into textbfHorizontal feature aggregation.
arXiv Detail & Related papers (2024-12-25T11:49:14Z) - Unity is Strength: Unifying Convolutional and Transformeral Features for Better Person Re-Identification [60.9670254833103]
Person Re-identification (ReID) aims to retrieve the specific person across non-overlapping cameras.
We propose a novel fusion framework called FusionReID to unify the strengths of CNNs and Transformers for image-based person ReID.
arXiv Detail & Related papers (2024-12-23T03:19:19Z) - Progressive Multi-Modal Fusion for Robust 3D Object Detection [12.048303829428452]
Existing methods perform sensor fusion in a single view by projecting features from both modalities either in Bird's Eye View (BEV) or Perspective View (PV)
We propose ProFusion3D, a progressive fusion framework that combines features in both BEV and PV at both intermediate and object query levels.
Our architecture hierarchically fuses local and global features, enhancing the robustness of 3D object detection.
arXiv Detail & Related papers (2024-10-09T22:57:47Z) - PoIFusion: Multi-Modal 3D Object Detection via Fusion at Points of Interest [65.48057241587398]
PoIFusion is a framework to fuse information of RGB images and LiDAR point clouds at the points of interest (PoIs)
Our approach maintains the view of each modality and obtains multi-modal features by computation-friendly projection and computation.
We conducted extensive experiments on nuScenes and Argoverse2 datasets to evaluate our approach.
arXiv Detail & Related papers (2024-03-14T09:28:12Z) - ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion [61.37481051263816]
Given a single image of a 3D object, this paper proposes a method (named ConsistNet) that is able to generate multiple images of the same object.
Our method effectively learns 3D consistency over a frozen Zero123 backbone and can generate 16 surrounding views of the object within 40 seconds on a single A100 GPU.
arXiv Detail & Related papers (2023-10-16T12:29:29Z) - SparseFusion: Fusing Multi-Modal Sparse Representations for Multi-Sensor
3D Object Detection [84.09798649295038]
Given that objects occupy only a small part of a scene, finding dense candidates and generating dense representations is noisy and inefficient.
We propose SparseFusion, a novel multi-sensor 3D detection method that exclusively uses sparse candidates and sparse representations.
SparseFusion achieves state-of-the-art performance on the nuScenes benchmark while also running at the fastest speed, even outperforming methods with stronger backbones.
arXiv Detail & Related papers (2023-04-27T17:17:39Z) - BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation [105.96557764248846]
We introduce BEVFusion, a generic multi-task multi-sensor fusion framework.
It unifies multi-modal features in the shared bird's-eye view representation space.
It achieves 1.3% higher mAP and NDS on 3D object detection and 13.6% higher mIoU on BEV map segmentation, with 1.9x lower cost.
arXiv Detail & Related papers (2022-05-26T17:59:35Z) - ScaleVLAD: Improving Multimodal Sentiment Analysis via Multi-Scale
Fusion of Locally Descriptors [15.042741192427334]
This paper proposes a fusion model named ScaleVLAD to gather multi-Scale representation from text, video, and audio.
Experiments on three popular sentiment analysis benchmarks, IEMOCAP, MOSI, and MOSEI, demonstrate significant gains over baselines.
arXiv Detail & Related papers (2021-12-02T16:09:33Z) - Image Fusion Transformer [75.71025138448287]
In image fusion, images obtained from different sensors are fused to generate a single image with enhanced information.
In recent years, state-of-the-art methods have adopted Convolution Neural Networks (CNNs) to encode meaningful features for image fusion.
We propose a novel Image Fusion Transformer (IFT) where we develop a transformer-based multi-scale fusion strategy.
arXiv Detail & Related papers (2021-07-19T16:42:49Z) - CMF: Cascaded Multi-model Fusion for Referring Image Segmentation [24.942658173937563]
We address the task of referring image segmentation (RIS), which aims at predicting a segmentation mask for the object described by a natural language expression.
We propose a simple yet effective Cascaded Multi-modal Fusion (CMF) module, which stacks multiple atrous convolutional layers in parallel.
Experimental results on four benchmark datasets demonstrate that our method outperforms most state-of-the-art methods.
arXiv Detail & Related papers (2021-06-16T08:18:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.