SparseFusion: Fusing Multi-Modal Sparse Representations for Multi-Sensor
3D Object Detection
- URL: http://arxiv.org/abs/2304.14340v1
- Date: Thu, 27 Apr 2023 17:17:39 GMT
- Title: SparseFusion: Fusing Multi-Modal Sparse Representations for Multi-Sensor
3D Object Detection
- Authors: Yichen Xie, Chenfeng Xu, Marie-Julie Rakotosaona, Patrick Rim,
Federico Tombari, Kurt Keutzer, Masayoshi Tomizuka, Wei Zhan
- Abstract summary: Given that objects occupy only a small part of a scene, finding dense candidates and generating dense representations is noisy and inefficient.
We propose SparseFusion, a novel multi-sensor 3D detection method that exclusively uses sparse candidates and sparse representations.
SparseFusion achieves state-of-the-art performance on the nuScenes benchmark while also running at the fastest speed, even outperforming methods with stronger backbones.
- Score: 84.09798649295038
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: By identifying four important components of existing LiDAR-camera 3D object
detection methods (LiDAR and camera candidates, transformation, and fusion
outputs), we observe that all existing methods either find dense candidates or
yield dense representations of scenes. However, given that objects occupy only
a small part of a scene, finding dense candidates and generating dense
representations is noisy and inefficient. We propose SparseFusion, a novel
multi-sensor 3D detection method that exclusively uses sparse candidates and
sparse representations. Specifically, SparseFusion utilizes the outputs of
parallel detectors in the LiDAR and camera modalities as sparse candidates for
fusion. We transform the camera candidates into the LiDAR coordinate space by
disentangling the object representations. Then, we can fuse the multi-modality
candidates in a unified 3D space by a lightweight self-attention module. To
mitigate negative transfer between modalities, we propose novel semantic and
geometric cross-modality transfer modules that are applied prior to the
modality-specific detectors. SparseFusion achieves state-of-the-art performance
on the nuScenes benchmark while also running at the fastest speed, even
outperforming methods with stronger backbones. We perform extensive experiments
to demonstrate the effectiveness and efficiency of our modules and overall
method pipeline. Our code will be made publicly available at
https://github.com/yichen928/SparseFusion.
Related papers
- Progressive Multi-Modal Fusion for Robust 3D Object Detection [12.048303829428452]
Existing methods perform sensor fusion in a single view by projecting features from both modalities either in Bird's Eye View (BEV) or Perspective View (PV)
We propose ProFusion3D, a progressive fusion framework that combines features in both BEV and PV at both intermediate and object query levels.
Our architecture hierarchically fuses local and global features, enhancing the robustness of 3D object detection.
arXiv Detail & Related papers (2024-10-09T22:57:47Z) - SparseFusion: Efficient Sparse Multi-Modal Fusion Framework for Long-Range 3D Perception [47.000734648271006]
We introduce SparseFusion, a novel multi-modal fusion framework built upon sparse 3D features to facilitate efficient long-range perception.
The proposed module introduces sparsity from both semantic and geometric aspects which only fill grids that foreground objects potentially reside in.
On the long-range Argoverse2 dataset, SparseFusion reduces memory footprint and accelerates the inference by about two times compared to dense detectors.
arXiv Detail & Related papers (2024-03-15T05:59:10Z) - Multi-Modal 3D Object Detection by Box Matching [109.43430123791684]
We propose a novel Fusion network by Box Matching (FBMNet) for multi-modal 3D detection.
With the learned assignments between 3D and 2D object proposals, the fusion for detection can be effectively performed by combing their ROI features.
arXiv Detail & Related papers (2023-05-12T18:08:51Z) - Unifying Voxel-based Representation with Transformer for 3D Object
Detection [143.91910747605107]
We present a unified framework for multi-modality 3D object detection, named UVTR.
The proposed method aims to unify multi-modality representations in the voxel space for accurate and robust single- or cross-modality 3D detection.
UVTR achieves leading performance in the nuScenes test set with 69.7%, 55.1%, and 71.1% NDS for LiDAR, camera, and multi-modality inputs, respectively.
arXiv Detail & Related papers (2022-06-01T17:02:40Z) - Focal Sparse Convolutional Networks for 3D Object Detection [121.45950754511021]
We introduce two new modules to enhance the capability of Sparse CNNs.
They are focal sparse convolution (Focals Conv) and its multi-modal variant of focal sparse convolution with fusion.
For the first time, we show that spatially learnable sparsity in sparse convolution is essential for sophisticated 3D object detection.
arXiv Detail & Related papers (2022-04-26T17:34:10Z) - FusionPainting: Multimodal Fusion with Adaptive Attention for 3D Object
Detection [15.641616738865276]
We propose a general multimodal fusion framework FusionPainting to fuse the 2D RGB image and 3D point clouds at a semantic level for boosting the 3D object detection task.
Especially, the FusionPainting framework consists of three main modules: a multi-modal semantic segmentation module, an adaptive attention-based semantic fusion module, and a 3D object detector.
The effectiveness of the proposed framework has been verified on the large-scale nuScenes detection benchmark.
arXiv Detail & Related papers (2021-06-23T14:53:22Z) - EPMF: Efficient Perception-aware Multi-sensor Fusion for 3D Semantic Segmentation [62.210091681352914]
We study multi-sensor fusion for 3D semantic segmentation for many applications, such as autonomous driving and robotics.
In this work, we investigate a collaborative fusion scheme called perception-aware multi-sensor fusion (PMF)
We propose a two-stream network to extract features from the two modalities separately. The extracted features are fused by effective residual-based fusion modules.
arXiv Detail & Related papers (2021-06-21T10:47:26Z) - CLOCs: Camera-LiDAR Object Candidates Fusion for 3D Object Detection [13.986963122264633]
We propose a novel Camera-LiDAR Object Candidates (CLOCs) fusion network.
CLOCs fusion provides a low-complexity multi-modal fusion framework.
We show that CLOCs ranks the highest among all the fusion-based methods in the official KITTI leaderboard.
arXiv Detail & Related papers (2020-09-02T02:07:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.