UniM$^2$AE: Multi-modal Masked Autoencoders with Unified 3D
Representation for 3D Perception in Autonomous Driving
- URL: http://arxiv.org/abs/2308.10421v2
- Date: Wed, 30 Aug 2023 02:32:08 GMT
- Title: UniM$^2$AE: Multi-modal Masked Autoencoders with Unified 3D
Representation for 3D Perception in Autonomous Driving
- Authors: Jian Zou, Tianyu Huang, Guanglei Yang, Zhenhua Guo, Wangmeng Zuo
- Abstract summary: Masked Autoencoders (MAE) play a pivotal role in learning potent representations, delivering outstanding results across various 3D perception tasks.
This research delves into multi-modal Masked Autoencoders tailored for a unified representation space in autonomous driving.
To intricately marry the semantics inherent in images with the geometric intricacies of LiDAR point clouds, the UniM$2$AE is proposed.
- Score: 51.37470133438836
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Masked Autoencoders (MAE) play a pivotal role in learning potent
representations, delivering outstanding results across various 3D perception
tasks essential for autonomous driving. In real-world driving scenarios, it's
commonplace to deploy multiple sensors for comprehensive environment
perception. While integrating multi-modal features from these sensors can
produce rich and powerful features, there is a noticeable gap in MAE methods
addressing this integration. This research delves into multi-modal Masked
Autoencoders tailored for a unified representation space in autonomous driving,
aiming to pioneer a more efficient fusion of two distinct modalities. To
intricately marry the semantics inherent in images with the geometric
intricacies of LiDAR point clouds, the UniM$^2$AE is proposed. This model
stands as a potent yet straightforward, multi-modal self-supervised
pre-training framework, mainly consisting of two designs. First, it projects
the features from both modalities into a cohesive 3D volume space, ingeniously
expanded from the bird's eye view (BEV) to include the height dimension. The
extension makes it possible to back-project the informative features, obtained
by fusing features from both modalities, into their native modalities to
reconstruct the multiple masked inputs. Second, the Multi-modal 3D Interactive
Module (MMIM) is invoked to facilitate the efficient inter-modal interaction
during the interaction process. Extensive experiments conducted on the nuScenes
Dataset attest to the efficacy of UniM$^2$AE, indicating enhancements in 3D
object detection and BEV map segmentation by 1.2\%(NDS) and 6.5\% (mIoU),
respectively. Code is available at https://github.com/hollow-503/UniM2AE.
Related papers
- Towards Transferable Multi-modal Perception Representation Learning for
Autonomy: NeRF-Supervised Masked AutoEncoder [1.90365714903665]
This work proposes a unified self-supervised pre-training framework for transferable multi-modal perception representation learning.
We show that the representation learned via NeRF-Supervised Masked AutoEncoder (NS-MAE) shows promising transferability for diverse multi-modal and single-modal (camera-only and Lidar-only) perception models.
We hope this study can inspire exploration of more general multi-modal representation learning for autonomous agents.
arXiv Detail & Related papers (2023-11-23T00:53:11Z) - UniTR: A Unified and Efficient Multi-Modal Transformer for
Bird's-Eye-View Representation [113.35352122662752]
We present an efficient multi-modal backbone for outdoor 3D perception named UniTR.
UniTR processes a variety of modalities with unified modeling and shared parameters.
UniTR is also a fundamentally task-agnostic backbone that naturally supports different 3D perception tasks.
arXiv Detail & Related papers (2023-08-15T12:13:44Z) - SimDistill: Simulated Multi-modal Distillation for BEV 3D Object
Detection [56.24700754048067]
Multi-view camera-based 3D object detection has become popular due to its low cost, but accurately inferring 3D geometry solely from camera data remains challenging.
We propose a Simulated multi-modal Distillation (SimDistill) method by carefully crafting the model architecture and distillation strategy.
Our SimDistill can learn better feature representations for 3D object detection while maintaining a cost-effective camera-only deployment.
arXiv Detail & Related papers (2023-03-29T16:08:59Z) - MSeg3D: Multi-modal 3D Semantic Segmentation for Autonomous Driving [15.36416000750147]
We propose a multi-modal 3D semantic segmentation model (MSeg3D) with joint intra-modal feature extraction and inter-modal feature fusion.
MSeg3D still shows robustness and improves the LiDAR-only baseline.
arXiv Detail & Related papers (2023-03-15T13:13:03Z) - PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D
Object Detection [26.03582038710992]
Masked Autoencoders learn strong visual representations and achieve state-of-the-art results in several independent modalities.
In this work, we focus on point cloud and RGB image data, two modalities that are often presented together in the real world.
We propose PiMAE, a self-supervised pre-training framework that promotes 3D and 2D interaction through three aspects.
arXiv Detail & Related papers (2023-03-14T17:58:03Z) - Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud
Pre-training [65.75399500494343]
Masked Autoencoders (MAE) have shown promising performance in self-supervised learning for 2D and 3D computer vision.
We propose Joint-MAE, a 2D-3D joint MAE framework for self-supervised 3D point cloud pre-training.
arXiv Detail & Related papers (2023-02-27T17:56:18Z) - HUM3DIL: Semi-supervised Multi-modal 3D Human Pose Estimation for
Autonomous Driving [95.42203932627102]
3D human pose estimation is an emerging technology, which can enable the autonomous vehicle to perceive and understand the subtle and complex behaviors of pedestrians.
Our method efficiently makes use of these complementary signals, in a semi-supervised fashion and outperforms existing methods with a large margin.
Specifically, we embed LiDAR points into pixel-aligned multi-modal features, which we pass through a sequence of Transformer refinement stages.
arXiv Detail & Related papers (2022-12-15T11:15:14Z) - AutoAlignV2: Deformable Feature Aggregation for Dynamic Multi-Modal 3D
Object Detection [17.526914782562528]
We propose AutoAlignV2, a faster and stronger multi-modal 3D detection framework, built on top of AutoAlign.
Our best model reaches 72.4 NDS on nuScenes test leaderboard, achieving new state-of-the-art results.
arXiv Detail & Related papers (2022-07-21T06:17:23Z) - BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View
Representation [116.6111047218081]
We introduce BEVFusion, a generic multi-task multi-sensor fusion framework.
It unifies multi-modal features in the shared bird's-eye view representation space.
It achieves 1.3% higher mAP and NDS on 3D object detection and 13.6% higher mIoU on BEV map segmentation, with 1.9x lower cost.
arXiv Detail & Related papers (2022-05-26T17:59:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.