PathFusion: Path-consistent Lidar-Camera Deep Feature Fusion
- URL: http://arxiv.org/abs/2212.06244v3
- Date: Tue, 16 Jan 2024 16:18:42 GMT
- Title: PathFusion: Path-consistent Lidar-Camera Deep Feature Fusion
- Authors: Lemeng Wu, Dilin Wang, Meng Li, Yunyang Xiong, Raghuraman
Krishnamoorthi, Qiang Liu, Vikas Chandra
- Abstract summary: We propose PathFusion as a solution to enable the alignment of semantically coherent LiDAR-camera deep feature fusion.
PathFusion introduces a path consistency loss at multiple stages within the network, encouraging the 2D backbone and its fusion path.
We observe an improvement of over 1.6% in mAP on the nuScenes test split consistently with and without testing-time data augmentations.
- Score: 30.803450612746403
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fusing 3D LiDAR features with 2D camera features is a promising technique for
enhancing the accuracy of 3D detection, thanks to their complementary physical
properties. While most of the existing methods focus on directly fusing camera
features with raw LiDAR point clouds or shallow-level 3D features, it is
observed that directly combining 2D and 3D features in deeper layers actually
leads to a decrease in accuracy due to feature misalignment. The misalignment,
which stems from the aggregation of features learned from large receptive
fields, becomes increasingly more severe as we delve into deeper layers. In
this paper, we propose PathFusion as a solution to enable the alignment of
semantically coherent LiDAR-camera deep feature fusion. PathFusion introduces a
path consistency loss at multiple stages within the network, encouraging the 2D
backbone and its fusion path to transform 2D features in a way that aligns
semantically with the transformation of the 3D backbone. This ensures semantic
consistency between 2D and 3D features, even in deeper layers, and amplifies
the usage of the network's learning capacity. We apply PathFusion to improve a
prior-art fusion baseline, Focals Conv, and observe an improvement of over 1.6%
in mAP on the nuScenes test split consistently with and without testing-time
data augmentations, and moreover, PathFusion also improves KITTI
$\text{AP}_{\text{3D}}$ (R11) by about 0.6% on the moderate level.
Related papers
- BiCo-Fusion: Bidirectional Complementary LiDAR-Camera Fusion for Semantic- and Spatial-Aware 3D Object Detection [10.321117046185321]
This letter proposes a novel bidirectional complementary Lidar-camera fusion framework, called BiCo-Fusion.
The key insight is to mutually fuse the multi-modal features to enhance the semantics of LiDAR features and the spatial awareness of the camera features.
We then introduce Unified Fusion to adaptively weight to select features from the enchanted Lidar and camera features to build a unified 3D representation.
arXiv Detail & Related papers (2024-06-27T09:56:38Z) - Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud
Pre-training [65.75399500494343]
Masked Autoencoders (MAE) have shown promising performance in self-supervised learning for 2D and 3D computer vision.
We propose Joint-MAE, a 2D-3D joint MAE framework for self-supervised 3D point cloud pre-training.
arXiv Detail & Related papers (2023-02-27T17:56:18Z) - Self-Supervised Geometry-Aware Encoder for Style-Based 3D GAN Inversion [115.82306502822412]
StyleGAN has achieved great progress in 2D face reconstruction and semantic editing via image inversion and latent editing.
A corresponding generic 3D GAN inversion framework is still missing, limiting the applications of 3D face reconstruction and semantic editing.
We study the challenging problem of 3D GAN inversion where a latent code is predicted given a single face image to faithfully recover its 3D shapes and detailed textures.
arXiv Detail & Related papers (2022-12-14T18:49:50Z) - Multi-Sem Fusion: Multimodal Semantic Fusion for 3D Object Detection [11.575945934519442]
LiDAR and camera fusion techniques are promising for achieving 3D object detection in autonomous driving.
Most multi-modal 3D object detection frameworks integrate semantic knowledge from 2D images into 3D LiDAR point clouds.
We propose a general multi-modal fusion framework Multi-Sem Fusion (MSF) to fuse the semantic information from both the 2D image and 3D points scene parsing results.
arXiv Detail & Related papers (2022-12-10T10:54:41Z) - FFPA-Net: Efficient Feature Fusion with Projection Awareness for 3D
Object Detection [19.419030878019974]
unstructured 3D point clouds are filled in the 2D plane and 3D point cloud features are extracted faster using projection-aware convolution layers.
The corresponding indexes between different sensor signals are established in advance in the data preprocessing.
Two new plug-and-play fusion modules, LiCamFuse and BiLiCamFuse, are proposed.
arXiv Detail & Related papers (2022-09-15T16:13:19Z) - MSMDFusion: Fusing LiDAR and Camera at Multiple Scales with Multi-Depth
Seeds for 3D Object Detection [89.26380781863665]
Fusing LiDAR and camera information is essential for achieving accurate and reliable 3D object detection in autonomous driving systems.
Recent approaches aim at exploring the semantic densities of camera features through lifting points in 2D camera images into 3D space for fusion.
We propose a novel framework that focuses on the multi-scale progressive interaction of the multi-granularity LiDAR and camera features.
arXiv Detail & Related papers (2022-09-07T12:29:29Z) - DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection [83.18142309597984]
Lidars and cameras are critical sensors that provide complementary information for 3D detection in autonomous driving.
We develop a family of generic multi-modal 3D detection models named DeepFusion, which is more accurate than previous methods.
arXiv Detail & Related papers (2022-03-15T18:46:06Z) - Volumetric Propagation Network: Stereo-LiDAR Fusion for Long-Range Depth
Estimation [81.08111209632501]
We propose a geometry-aware stereo-LiDAR fusion network for long-range depth estimation.
We exploit sparse and accurate point clouds as a cue for guiding correspondences of stereo images in a unified 3D volume space.
Our network achieves state-of-the-art performance on the KITTI and the Virtual- KITTI datasets.
arXiv Detail & Related papers (2021-03-24T03:24:46Z) - Cross-Modality 3D Object Detection [63.29935886648709]
We present a novel two-stage multi-modal fusion network for 3D object detection.
The whole architecture facilitates two-stage fusion.
Our experiments on the KITTI dataset show that the proposed multi-stage fusion helps the network to learn better representations.
arXiv Detail & Related papers (2020-08-16T11:01:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.