LoGoNet: Towards Accurate 3D Object Detection with Local-to-Global
Cross-Modal Fusion
- URL: http://arxiv.org/abs/2303.03595v1
- Date: Tue, 7 Mar 2023 02:00:34 GMT
- Title: LoGoNet: Towards Accurate 3D Object Detection with Local-to-Global
Cross-Modal Fusion
- Authors: Xin Li, Tao Ma, Yuenan Hou, Botian Shi, Yucheng Yang, Youquan Liu,
Xingjiao Wu, Qin Chen, Yikang Li, Yu Qiao, Liang He
- Abstract summary: novel Local-to-Global fusion network (LoGoNet)
LoGoNet ranks 1st on 3D object detection leaderboard.
For the first time, the detection performance on three classes surpasses 80 APH (L2) simultaneously.
- Score: 40.44084541717407
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: LiDAR-camera fusion methods have shown impressive performance in 3D object
detection. Recent advanced multi-modal methods mainly perform global fusion,
where image features and point cloud features are fused across the whole scene.
Such practice lacks fine-grained region-level information, yielding suboptimal
fusion performance. In this paper, we present the novel Local-to-Global fusion
network (LoGoNet), which performs LiDAR-camera fusion at both local and global
levels. Concretely, the Global Fusion (GoF) of LoGoNet is built upon previous
literature, while we exclusively use point centroids to more precisely
represent the position of voxel features, thus achieving better cross-modal
alignment. As to the Local Fusion (LoF), we first divide each proposal into
uniform grids and then project these grid centers to the images. The image
features around the projected grid points are sampled to be fused with
position-decorated point cloud features, maximally utilizing the rich
contextual information around the proposals. The Feature Dynamic Aggregation
(FDA) module is further proposed to achieve information interaction between
these locally and globally fused features, thus producing more informative
multi-modal features. Extensive experiments on both Waymo Open Dataset (WOD)
and KITTI datasets show that LoGoNet outperforms all state-of-the-art 3D
detection methods. Notably, LoGoNet ranks 1st on Waymo 3D object detection
leaderboard and obtains 81.02 mAPH (L2) detection performance. It is noteworthy
that, for the first time, the detection performance on three classes surpasses
80 APH (L2) simultaneously. Code will be available at
\url{https://github.com/sankin97/LoGoNet}.
Related papers
- PoIFusion: Multi-Modal 3D Object Detection via Fusion at Points of Interest [65.48057241587398]
PoIFusion is a framework to fuse information of RGB images and LiDAR point clouds at the points of interest (PoIs)
Our approach maintains the view of each modality and obtains multi-modal features by computation-friendly projection and computation.
We conducted extensive experiments on nuScenes and Argoverse2 datasets to evaluate our approach.
arXiv Detail & Related papers (2024-03-14T09:28:12Z) - RGB-T Semantic Segmentation with Location, Activation, and Sharpening [27.381263494613556]
We propose a novel feature fusion-based network for RGB-T semantic segmentation, named emphLASNet, which follows three steps of location, activation, and sharpening.
Experimental results on two public datasets demonstrate that the superiority of our LASNet over relevant state-of-the-art methods.
arXiv Detail & Related papers (2022-10-26T07:42:34Z) - DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection [83.18142309597984]
Lidars and cameras are critical sensors that provide complementary information for 3D detection in autonomous driving.
We develop a family of generic multi-modal 3D detection models named DeepFusion, which is more accurate than previous methods.
arXiv Detail & Related papers (2022-03-15T18:46:06Z) - LATFormer: Locality-Aware Point-View Fusion Transformer for 3D Shape
Recognition [38.540048855119004]
We propose a novel Locality-Aware Point-View Fusion Transformer (LATFormer) for 3D shape retrieval and classification.
The core component of LATFormer is a module named Locality-Aware Fusion (LAF) which integrates the local features of correlated regions across the two modalities.
In our LATFormer, we utilize the LAF module to fuse the multi-scale features of the two modalities both bidirectionally and hierarchically to obtain more informative features.
arXiv Detail & Related papers (2021-09-03T03:23:27Z) - MBDF-Net: Multi-Branch Deep Fusion Network for 3D Object Detection [17.295359521427073]
We propose a Multi-Branch Deep Fusion Network (MBDF-Net) for 3D object detection.
In the first stage, our multi-branch feature extraction network utilizes Adaptive Attention Fusion modules to produce cross-modal fusion features from single-modal semantic features.
In the second stage, we use a region of interest (RoI) -pooled fusion module to generate enhanced local features for refinement.
arXiv Detail & Related papers (2021-08-29T15:40:15Z) - Similarity-Aware Fusion Network for 3D Semantic Segmentation [87.51314162700315]
We propose a similarity-aware fusion network (SAFNet) to adaptively fuse 2D images and 3D point clouds for 3D semantic segmentation.
We employ a late fusion strategy where we first learn the geometric and contextual similarities between the input and back-projected (from 2D pixels) point clouds.
We show that SAFNet significantly outperforms existing state-of-the-art fusion-based approaches across various data integrity.
arXiv Detail & Related papers (2021-07-04T09:28:18Z) - Volumetric Propagation Network: Stereo-LiDAR Fusion for Long-Range Depth
Estimation [81.08111209632501]
We propose a geometry-aware stereo-LiDAR fusion network for long-range depth estimation.
We exploit sparse and accurate point clouds as a cue for guiding correspondences of stereo images in a unified 3D volume space.
Our network achieves state-of-the-art performance on the KITTI and the Virtual- KITTI datasets.
arXiv Detail & Related papers (2021-03-24T03:24:46Z) - Cross-Modality 3D Object Detection [63.29935886648709]
We present a novel two-stage multi-modal fusion network for 3D object detection.
The whole architecture facilitates two-stage fusion.
Our experiments on the KITTI dataset show that the proposed multi-stage fusion helps the network to learn better representations.
arXiv Detail & Related papers (2020-08-16T11:01:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.