AutoAlign: Pixel-Instance Feature Aggregation for Multi-Modal 3D Object
Detection
- URL: http://arxiv.org/abs/2201.06493v1
- Date: Mon, 17 Jan 2022 16:08:57 GMT
- Title: AutoAlign: Pixel-Instance Feature Aggregation for Multi-Modal 3D Object
Detection
- Authors: Zehui Chen, Zhenyu Li, Shiquan Zhang, Liangji Fang, Qinghong Jiang,
Feng Zhao, Bolei Zhou, Hang Zhao
- Abstract summary: We propose textitAutoAlign, an automatic feature fusion strategy for 3D object detection.
We show that our approach can lead to 2.3 mAP and 7.0 mAP improvements on the KITTI and nuScenes datasets.
- Score: 46.03951171790736
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Object detection through either RGB images or the LiDAR point clouds has been
extensively explored in autonomous driving. However, it remains challenging to
make these two data sources complementary and beneficial to each other. In this
paper, we propose \textit{AutoAlign}, an automatic feature fusion strategy for
3D object detection. Instead of establishing deterministic correspondence with
camera projection matrix, we model the mapping relationship between the image
and point clouds with a learnable alignment map. This map enables our model to
automate the alignment of non-homogenous features in a dynamic and data-driven
manner. Specifically, a cross-attention feature alignment module is devised to
adaptively aggregate \textit{pixel-level} image features for each voxel. To
enhance the semantic consistency during feature alignment, we also design a
self-supervised cross-modal feature interaction module, through which the model
can learn feature aggregation with \textit{instance-level} feature guidance.
Extensive experimental results show that our approach can lead to 2.3 mAP and
7.0 mAP improvements on the KITTI and nuScenes datasets, respectively. Notably,
our best model reaches 70.9 NDS on the nuScenes testing leaderboard, achieving
competitive performance among various state-of-the-arts.
Related papers
- MV2DFusion: Leveraging Modality-Specific Object Semantics for Multi-Modal 3D Detection [28.319440934322728]
MV2DFusion is a multi-modal detection framework that integrates the strengths of both worlds through an advanced query-based fusion mechanism.
Our framework's flexibility allows it to integrate with any image and point cloud-based detectors, showcasing its adaptability and potential for future advancements.
arXiv Detail & Related papers (2024-08-12T06:46:05Z) - Self-supervised Learning of LiDAR 3D Point Clouds via 2D-3D Neural Calibration [107.61458720202984]
This paper introduces a novel self-supervised learning framework for enhancing 3D perception in autonomous driving scenes.
We propose the learnable transformation alignment to bridge the domain gap between image and point cloud data.
We establish dense 2D-3D correspondences to estimate the rigid pose.
arXiv Detail & Related papers (2024-01-23T02:41:06Z) - Multi-Projection Fusion and Refinement Network for Salient Object
Detection in 360{\deg} Omnidirectional Image [141.10227079090419]
We propose a Multi-Projection Fusion and Refinement Network (MPFR-Net) to detect the salient objects in 360deg omnidirectional image.
MPFR-Net uses the equirectangular projection image and four corresponding cube-unfolding images as inputs.
Experimental results on two omnidirectional datasets demonstrate that the proposed approach outperforms the state-of-the-art methods both qualitatively and quantitatively.
arXiv Detail & Related papers (2022-12-23T14:50:40Z) - PAI3D: Painting Adaptive Instance-Prior for 3D Object Detection [22.41785292720421]
Painting Adaptive Instance-prior for 3D object detection (PAI3D) is a sequential instance-level fusion framework.
It first extracts instance-level semantic information from images.
Extracted information, including objects categorical label, point-to-object membership and object position, are then used to augment each LiDAR point in the subsequent 3D detection network.
arXiv Detail & Related papers (2022-11-15T11:15:25Z) - 3DMODT: Attention-Guided Affinities for Joint Detection & Tracking in 3D
Point Clouds [95.54285993019843]
We propose a method for joint detection and tracking of multiple objects in 3D point clouds.
Our model exploits temporal information employing multiple frames to detect objects and track them in a single network.
arXiv Detail & Related papers (2022-11-01T20:59:38Z) - Homogeneous Multi-modal Feature Fusion and Interaction for 3D Object
Detection [16.198358858773258]
Multi-modal 3D object detection has been an active research topic in autonomous driving.
It is non-trivial to explore the cross-modal feature fusion between sparse 3D points and dense 2D pixels.
Recent approaches either fuse the image features with the point cloud features that are projected onto the 2D image plane or combine the sparse point cloud with dense image pixels.
arXiv Detail & Related papers (2022-10-18T06:15:56Z) - AutoAlignV2: Deformable Feature Aggregation for Dynamic Multi-Modal 3D
Object Detection [17.526914782562528]
We propose AutoAlignV2, a faster and stronger multi-modal 3D detection framework, built on top of AutoAlign.
Our best model reaches 72.4 NDS on nuScenes test leaderboard, achieving new state-of-the-art results.
arXiv Detail & Related papers (2022-07-21T06:17:23Z) - Benchmarking the Robustness of LiDAR-Camera Fusion for 3D Object
Detection [58.81316192862618]
Two critical sensors for 3D perception in autonomous driving are the camera and the LiDAR.
fusing these two modalities can significantly boost the performance of 3D perception models.
We benchmark the state-of-the-art fusion methods for the first time.
arXiv Detail & Related papers (2022-05-30T09:35:37Z) - SA-Det3D: Self-Attention Based Context-Aware 3D Object Detection [9.924083358178239]
We propose two variants of self-attention for contextual modeling in 3D object detection.
We first incorporate the pairwise self-attention mechanism into the current state-of-the-art BEV, voxel and point-based detectors.
Next, we propose a self-attention variant that samples a subset of the most representative features by learning deformations over randomly sampled locations.
arXiv Detail & Related papers (2021-01-07T18:30:32Z) - PerMO: Perceiving More at Once from a Single Image for Autonomous
Driving [76.35684439949094]
We present a novel approach to detect, segment, and reconstruct complete textured 3D models of vehicles from a single image.
Our approach combines the strengths of deep learning and the elegance of traditional techniques.
We have integrated these algorithms with an autonomous driving system.
arXiv Detail & Related papers (2020-07-16T05:02:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.