FlatFusion: Delving into Details of Sparse Transformer-based Camera-LiDAR Fusion for Autonomous Driving
- URL: http://arxiv.org/abs/2408.06832v1
- Date: Tue, 13 Aug 2024 11:46:32 GMT
- Title: FlatFusion: Delving into Details of Sparse Transformer-based Camera-LiDAR Fusion for Autonomous Driving
- Authors: Yutao Zhu, Xiaosong Jia, Xinyu Yang, Junchi Yan,
- Abstract summary: The integration of data from diverse sensor modalities constitutes a prevalent methodology within the ambit of autonomous driving scenarios.
Recent advancements in efficient point cloud transformers have underscored the efficacy of integrating information in sparse formats.
In this paper, we conduct a comprehensive exploration of design choices for Transformer-based sparse cameraLiDAR fusion.
- Score: 63.96049803915402
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The integration of data from diverse sensor modalities (e.g., camera and LiDAR) constitutes a prevalent methodology within the ambit of autonomous driving scenarios. Recent advancements in efficient point cloud transformers have underscored the efficacy of integrating information in sparse formats. When it comes to fusion, since image patches are dense in pixel space with ambiguous depth, it necessitates additional design considerations for effective fusion. In this paper, we conduct a comprehensive exploration of design choices for Transformer-based sparse cameraLiDAR fusion. This investigation encompasses strategies for image-to-3D and LiDAR-to-2D mapping, attention neighbor grouping, single modal tokenizer, and micro-structure of Transformer. By amalgamating the most effective principles uncovered through our investigation, we introduce FlatFusion, a carefully designed framework for sparse camera-LiDAR fusion. Notably, FlatFusion significantly outperforms state-of-the-art sparse Transformer-based methods, including UniTR, CMT, and SparseFusion, achieving 73.7 NDS on the nuScenes validation set with 10.1 FPS with PyTorch.
Related papers
- Progressive Multi-Modal Fusion for Robust 3D Object Detection [12.048303829428452]
Existing methods perform sensor fusion in a single view by projecting features from both modalities either in Bird's Eye View (BEV) or Perspective View (PV)
We propose ProFusion3D, a progressive fusion framework that combines features in both BEV and PV at both intermediate and object query levels.
Our architecture hierarchically fuses local and global features, enhancing the robustness of 3D object detection.
arXiv Detail & Related papers (2024-10-09T22:57:47Z) - Quantum Inverse Contextual Vision Transformers (Q-ICVT): A New Frontier in 3D Object Detection for AVs [4.378378863689719]
We develop an innovative two-stage fusion process called Quantum Inverse Contextual Vision Transformers (Q-ICVT)
This approach leverages adiabatic computing in quantum concepts to create a novel reversible vision transformer known as the Global Adiabatic Transformer (GAT)
Our experiments show that Q-ICVT achieves an mAPH of 82.54 for L2 difficulties on the dataset, improving by 1.88% over current state-of-the-art fusion methods.
arXiv Detail & Related papers (2024-08-20T21:36:57Z) - SparseFusion: Fusing Multi-Modal Sparse Representations for Multi-Sensor
3D Object Detection [84.09798649295038]
Given that objects occupy only a small part of a scene, finding dense candidates and generating dense representations is noisy and inefficient.
We propose SparseFusion, a novel multi-sensor 3D detection method that exclusively uses sparse candidates and sparse representations.
SparseFusion achieves state-of-the-art performance on the nuScenes benchmark while also running at the fastest speed, even outperforming methods with stronger backbones.
arXiv Detail & Related papers (2023-04-27T17:17:39Z) - CRAFT: Camera-Radar 3D Object Detection with Spatio-Contextual Fusion
Transformer [14.849645397321185]
Camera radar sensors have significant advantages in cost, reliability, and maintenance compared to LiDAR.
Existing fusion methods often fuse the outputs of single modalities at the result-level, called the late fusion strategy.
Here we propose a novel proposal-level early fusion approach that effectively exploits both spatial and contextual properties of camera and radar for 3D object detection.
Our camera-radar fusion approach achieves the state-of-the-art 41.1% mAP and 52.3% NDS on the nuScenes test set, which is 8.7 and 10.8 points higher than the camera-only baseline, as well as yielding competitive performance on the
arXiv Detail & Related papers (2022-09-14T10:25:30Z) - TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with
Transformers [49.689566246504356]
We propose TransFusion, a robust solution to LiDAR-camera fusion with a soft-association mechanism to handle inferior image conditions.
TransFusion achieves state-of-the-art performance on large-scale datasets.
We extend the proposed method to the 3D tracking task and achieve the 1st place in the leaderboard of nuScenes tracking.
arXiv Detail & Related papers (2022-03-22T07:15:13Z) - DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection [83.18142309597984]
Lidars and cameras are critical sensors that provide complementary information for 3D detection in autonomous driving.
We develop a family of generic multi-modal 3D detection models named DeepFusion, which is more accurate than previous methods.
arXiv Detail & Related papers (2022-03-15T18:46:06Z) - Image Fusion Transformer [75.71025138448287]
In image fusion, images obtained from different sensors are fused to generate a single image with enhanced information.
In recent years, state-of-the-art methods have adopted Convolution Neural Networks (CNNs) to encode meaningful features for image fusion.
We propose a novel Image Fusion Transformer (IFT) where we develop a transformer-based multi-scale fusion strategy.
arXiv Detail & Related papers (2021-07-19T16:42:49Z) - EPMF: Efficient Perception-aware Multi-sensor Fusion for 3D Semantic Segmentation [62.210091681352914]
We study multi-sensor fusion for 3D semantic segmentation for many applications, such as autonomous driving and robotics.
In this work, we investigate a collaborative fusion scheme called perception-aware multi-sensor fusion (PMF)
We propose a two-stream network to extract features from the two modalities separately. The extracted features are fused by effective residual-based fusion modules.
arXiv Detail & Related papers (2021-06-21T10:47:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.