VoxelNextFusion: A Simple, Unified and Effective Voxel Fusion Framework
for Multi-Modal 3D Object Detection
- URL: http://arxiv.org/abs/2401.02702v1
- Date: Fri, 5 Jan 2024 08:10:49 GMT
- Title: VoxelNextFusion: A Simple, Unified and Effective Voxel Fusion Framework
for Multi-Modal 3D Object Detection
- Authors: Ziying Song, Guoxin Zhang, Jun Xie, Lin Liu, Caiyan Jia, Shaoqing Xu,
Zhepeng Wang
- Abstract summary: Existing voxel-based methods face challenges when fusing sparse voxel features with dense image features in a one-to-one manner.
We present VoxelNextFusion, a multi-modal 3D object detection framework specifically designed for voxel-based methods.
- Score: 33.46363259200292
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: LiDAR-camera fusion can enhance the performance of 3D object detection by
utilizing complementary information between depth-aware LiDAR points and
semantically rich images. Existing voxel-based methods face significant
challenges when fusing sparse voxel features with dense image features in a
one-to-one manner, resulting in the loss of the advantages of images, including
semantic and continuity information, leading to sub-optimal detection
performance, especially at long distances. In this paper, we present
VoxelNextFusion, a multi-modal 3D object detection framework specifically
designed for voxel-based methods, which effectively bridges the gap between
sparse point clouds and dense images. In particular, we propose a voxel-based
image pipeline that involves projecting point clouds onto images to obtain both
pixel- and patch-level features. These features are then fused using a
self-attention to obtain a combined representation. Moreover, to address the
issue of background features present in patches, we propose a feature
importance module that effectively distinguishes between foreground and
background features, thus minimizing the impact of the background features.
Extensive experiments were conducted on the widely used KITTI and nuScenes 3D
object detection benchmarks. Notably, our VoxelNextFusion achieved around
+3.20% in AP@0.7 improvement for car detection in hard level compared to the
Voxel R-CNN baseline on the KITTI test dataset
Related papers
- PVAFN: Point-Voxel Attention Fusion Network with Multi-Pooling Enhancing for 3D Object Detection [59.355022416218624]
integration of point and voxel representations is becoming more common in LiDAR-based 3D object detection.
We propose a novel two-stage 3D object detector, called Point-Voxel Attention Fusion Network (PVAFN)
PVAFN uses a multi-pooling strategy to integrate both multi-scale and region-specific information effectively.
arXiv Detail & Related papers (2024-08-26T19:43:01Z) - FusionViT: Hierarchical 3D Object Detection via LiDAR-Camera Vision
Transformer Fusion [8.168523242105763]
We will introduce a novel vision transformer-based 3D object detection model, namely FusionViT.
Our FusionViT model can achieve state-of-the-art performance and outperforms existing baseline methods.
arXiv Detail & Related papers (2023-11-07T00:12:01Z) - MLF-DET: Multi-Level Fusion for Cross-Modal 3D Object Detection [54.52102265418295]
We propose a novel and effective Multi-Level Fusion network, named as MLF-DET, for high-performance cross-modal 3D object DETection.
For the feature-level fusion, we present the Multi-scale Voxel Image fusion (MVI) module, which densely aligns multi-scale voxel features with image features.
For the decision-level fusion, we propose the lightweight Feature-cued Confidence Rectification (FCR) module, which exploits image semantics to rectify the confidence of detection candidates.
arXiv Detail & Related papers (2023-07-18T11:26:02Z) - Adjacent-Level Feature Cross-Fusion With 3-D CNN for Remote Sensing
Image Change Detection [20.776673215108815]
We propose a novel adjacent-level feature fusion network with 3D convolution (named AFCF3D-Net)
The proposed AFCF3D-Net has been validated on the three challenging remote sensing CD datasets.
arXiv Detail & Related papers (2023-02-10T08:21:01Z) - Homogeneous Multi-modal Feature Fusion and Interaction for 3D Object
Detection [16.198358858773258]
Multi-modal 3D object detection has been an active research topic in autonomous driving.
It is non-trivial to explore the cross-modal feature fusion between sparse 3D points and dense 2D pixels.
Recent approaches either fuse the image features with the point cloud features that are projected onto the 2D image plane or combine the sparse point cloud with dense image pixels.
arXiv Detail & Related papers (2022-10-18T06:15:56Z) - FFPA-Net: Efficient Feature Fusion with Projection Awareness for 3D
Object Detection [19.419030878019974]
unstructured 3D point clouds are filled in the 2D plane and 3D point cloud features are extracted faster using projection-aware convolution layers.
The corresponding indexes between different sensor signals are established in advance in the data preprocessing.
Two new plug-and-play fusion modules, LiCamFuse and BiLiCamFuse, are proposed.
arXiv Detail & Related papers (2022-09-15T16:13:19Z) - AGO-Net: Association-Guided 3D Point Cloud Object Detection Network [86.10213302724085]
We propose a novel 3D detection framework that associates intact features for objects via domain adaptation.
We achieve new state-of-the-art performance on the KITTI 3D detection benchmark in both accuracy and speed.
arXiv Detail & Related papers (2022-08-24T16:54:38Z) - Unifying Voxel-based Representation with Transformer for 3D Object
Detection [143.91910747605107]
We present a unified framework for multi-modality 3D object detection, named UVTR.
The proposed method aims to unify multi-modality representations in the voxel space for accurate and robust single- or cross-modality 3D detection.
UVTR achieves leading performance in the nuScenes test set with 69.7%, 55.1%, and 71.1% NDS for LiDAR, camera, and multi-modality inputs, respectively.
arXiv Detail & Related papers (2022-06-01T17:02:40Z) - Voxel Field Fusion for 3D Object Detection [140.6941303279114]
We present a conceptually simple framework for cross-modality 3D object detection, named voxel field fusion.
The proposed approach aims to maintain cross-modality consistency by representing and fusing augmented image features as a ray in the voxel field.
The framework is demonstrated to achieve consistent gains in various benchmarks and outperforms previous fusion-based methods on KITTI and nuScenes datasets.
arXiv Detail & Related papers (2022-05-31T16:31:36Z) - VPFNet: Voxel-Pixel Fusion Network for Multi-class 3D Object Detection [5.12292602924464]
This paper proposes a fusion-based 3D object detection network, named Voxel-Pixel Fusion Network (VPFNet)
The proposed method is evaluated on the KITTI benchmark for multi-class 3D object detection task under multilevel difficulty.
It is shown to outperform all state-of-the-art methods in mean average precision (mAP)
arXiv Detail & Related papers (2021-11-01T14:17:09Z) - Cross-Modality 3D Object Detection [63.29935886648709]
We present a novel two-stage multi-modal fusion network for 3D object detection.
The whole architecture facilitates two-stage fusion.
Our experiments on the KITTI dataset show that the proposed multi-stage fusion helps the network to learn better representations.
arXiv Detail & Related papers (2020-08-16T11:01:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.