AutoAlignV2: Deformable Feature Aggregation for Dynamic Multi-Modal 3D
Object Detection
- URL: http://arxiv.org/abs/2207.10316v1
- Date: Thu, 21 Jul 2022 06:17:23 GMT
- Title: AutoAlignV2: Deformable Feature Aggregation for Dynamic Multi-Modal 3D
Object Detection
- Authors: Zehui Chen, Zhenyu Li, Shiquan Zhang, Liangji Fang, Qinhong Jiang,
Feng Zhao
- Abstract summary: We propose AutoAlignV2, a faster and stronger multi-modal 3D detection framework, built on top of AutoAlign.
Our best model reaches 72.4 NDS on nuScenes test leaderboard, achieving new state-of-the-art results.
- Score: 17.526914782562528
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Point clouds and RGB images are two general perceptional sources in
autonomous driving. The former can provide accurate localization of objects,
and the latter is denser and richer in semantic information. Recently,
AutoAlign presents a learnable paradigm in combining these two modalities for
3D object detection. However, it suffers from high computational cost
introduced by the global-wise attention. To solve the problem, we propose
Cross-Domain DeformCAFA module in this work. It attends to sparse learnable
sampling points for cross-modal relational modeling, which enhances the
tolerance to calibration error and greatly speeds up the feature aggregation
across different modalities. To overcome the complex GT-AUG under multi-modal
settings, we design a simple yet effective cross-modal augmentation strategy on
convex combination of image patches given their depth information. Moreover, by
carrying out a novel image-level dropout training scheme, our model is able to
infer in a dynamic manner. To this end, we propose AutoAlignV2, a faster and
stronger multi-modal 3D detection framework, built on top of AutoAlign.
Extensive experiments on nuScenes benchmark demonstrate the effectiveness and
efficiency of AutoAlignV2. Notably, our best model reaches 72.4 NDS on nuScenes
test leaderboard, achieving new state-of-the-art results among all published
multi-modal 3D object detectors. Code will be available at
https://github.com/zehuichen123/AutoAlignV2.
Related papers
- An Efficient Wide-Range Pseudo-3D Vehicle Detection Using A Single
Camera [10.573423265001706]
This paper proposes a novel wide-range Pseudo-3D Vehicle Detection method based on images from a single camera.
To detect pseudo-3D objects, our model adopts specifically designed detection heads.
Joint constraint loss combining both the object box and SPL is designed during model training, improving the efficiency, stability, and prediction accuracy of the model.
arXiv Detail & Related papers (2023-09-15T12:50:09Z) - UniM$^2$AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving [47.590099762244535]
Masked Autoencoders (MAE) play a pivotal role in learning potent representations, delivering outstanding results across various 3D perception tasks.
This research delves into multi-modal Masked Autoencoders tailored for a unified representation space in autonomous driving.
To intricately marry the semantics inherent in images with the geometric intricacies of LiDAR point clouds, we propose UniM$2$AE.
arXiv Detail & Related papers (2023-08-21T02:13:40Z) - UniTR: A Unified and Efficient Multi-Modal Transformer for
Bird's-Eye-View Representation [113.35352122662752]
We present an efficient multi-modal backbone for outdoor 3D perception named UniTR.
UniTR processes a variety of modalities with unified modeling and shared parameters.
UniTR is also a fundamentally task-agnostic backbone that naturally supports different 3D perception tasks.
arXiv Detail & Related papers (2023-08-15T12:13:44Z) - GOOD: General Optimization-based Fusion for 3D Object Detection via
LiDAR-Camera Object Candidates [10.534984939225014]
3D object detection serves as the core basis of the perception tasks in autonomous driving.
Good is a general optimization-based fusion framework that can achieve satisfying detection without training additional models.
Experiments on both nuScenes and KITTI datasets are carried out and the results show that GOOD outperforms by 9.1% on mAP score compared with PointPillars.
arXiv Detail & Related papers (2023-03-17T07:05:04Z) - Unleash the Potential of Image Branch for Cross-modal 3D Object
Detection [67.94357336206136]
We present a new cross-modal 3D object detector, namely UPIDet, which aims to unleash the potential of the image branch from two aspects.
First, UPIDet introduces a new 2D auxiliary task called normalized local coordinate map estimation.
Second, we discover that the representational capability of the point cloud backbone can be enhanced through the gradients backpropagated from the training objectives of the image branch.
arXiv Detail & Related papers (2023-01-22T08:26:58Z) - A Simple Baseline for Multi-Camera 3D Object Detection [94.63944826540491]
3D object detection with surrounding cameras has been a promising direction for autonomous driving.
We present SimMOD, a Simple baseline for Multi-camera Object Detection.
We conduct extensive experiments on the 3D object detection benchmark of nuScenes to demonstrate the effectiveness of SimMOD.
arXiv Detail & Related papers (2022-08-22T03:38:01Z) - DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection [83.18142309597984]
Lidars and cameras are critical sensors that provide complementary information for 3D detection in autonomous driving.
We develop a family of generic multi-modal 3D detection models named DeepFusion, which is more accurate than previous methods.
arXiv Detail & Related papers (2022-03-15T18:46:06Z) - AutoAlign: Pixel-Instance Feature Aggregation for Multi-Modal 3D Object
Detection [46.03951171790736]
We propose textitAutoAlign, an automatic feature fusion strategy for 3D object detection.
We show that our approach can lead to 2.3 mAP and 7.0 mAP improvements on the KITTI and nuScenes datasets.
arXiv Detail & Related papers (2022-01-17T16:08:57Z) - SA-Det3D: Self-Attention Based Context-Aware 3D Object Detection [9.924083358178239]
We propose two variants of self-attention for contextual modeling in 3D object detection.
We first incorporate the pairwise self-attention mechanism into the current state-of-the-art BEV, voxel and point-based detectors.
Next, we propose a self-attention variant that samples a subset of the most representative features by learning deformations over randomly sampled locations.
arXiv Detail & Related papers (2021-01-07T18:30:32Z) - PerMO: Perceiving More at Once from a Single Image for Autonomous
Driving [76.35684439949094]
We present a novel approach to detect, segment, and reconstruct complete textured 3D models of vehicles from a single image.
Our approach combines the strengths of deep learning and the elegance of traditional techniques.
We have integrated these algorithms with an autonomous driving system.
arXiv Detail & Related papers (2020-07-16T05:02:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.