Towards Transferable Multi-modal Perception Representation Learning for
Autonomy: NeRF-Supervised Masked AutoEncoder
- URL: http://arxiv.org/abs/2311.13750v2
- Date: Wed, 6 Dec 2023 05:20:16 GMT
- Title: Towards Transferable Multi-modal Perception Representation Learning for
Autonomy: NeRF-Supervised Masked AutoEncoder
- Authors: Xiaohao Xu
- Abstract summary: This work proposes a unified self-supervised pre-training framework for transferable multi-modal perception representation learning.
We show that the representation learned via NeRF-Supervised Masked AutoEncoder (NS-MAE) shows promising transferability for diverse multi-modal and single-modal (camera-only and Lidar-only) perception models.
We hope this study can inspire exploration of more general multi-modal representation learning for autonomous agents.
- Score: 1.90365714903665
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This work proposes a unified self-supervised pre-training framework for
transferable multi-modal perception representation learning via masked
multi-modal reconstruction in Neural Radiance Field (NeRF), namely
NeRF-Supervised Masked AutoEncoder (NS-MAE). Specifically, conditioned on
certain view directions and locations, multi-modal embeddings extracted from
corrupted multi-modal input signals, i.e., Lidar point clouds and images, are
rendered into projected multi-modal feature maps via neural rendering. Then,
original multi-modal signals serve as reconstruction targets for the rendered
multi-modal feature maps to enable self-supervised representation learning.
Extensive experiments show that the representation learned via NS-MAE shows
promising transferability for diverse multi-modal and single-modal (camera-only
and Lidar-only) perception models on diverse 3D perception downstream tasks (3D
object detection and BEV map segmentation) with diverse amounts of fine-tuning
labeled data. Moreover, we empirically find that NS-MAE enjoys the synergy of
both the mechanism of masked autoencoder and neural radiance field. We hope
this study can inspire exploration of more general multi-modal representation
learning for autonomous agents.
Related papers
- Learning Shared RGB-D Fields: Unified Self-supervised Pre-training for Label-efficient LiDAR-Camera 3D Perception [17.11366229887873]
We introduce a unified pretraining strategy, NeRF-Supervised Masked Auto (NS-MAE)
NS-MAE exploits NeRF's ability to encode both appearance and geometry, enabling efficient masked reconstruction of multi-modal data.
Results: NS-MAE outperforms prior SOTA pre-training methods that employ separate strategies for each modality.
arXiv Detail & Related papers (2024-05-28T08:13:49Z) - 4M: Massively Multimodal Masked Modeling [20.69496647914175]
Current machine learning models for vision are often highly specialized and limited to a single modality and task.
Recent large language models exhibit a wide range of capabilities, hinting at a possibility for similarly versatile models in computer vision.
We propose a multimodal training scheme called 4M for training versatile and scalable foundation models for vision tasks.
arXiv Detail & Related papers (2023-12-11T18:57:35Z) - UniM$^2$AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving [47.590099762244535]
Masked Autoencoders (MAE) play a pivotal role in learning potent representations, delivering outstanding results across various 3D perception tasks.
This research delves into multi-modal Masked Autoencoders tailored for a unified representation space in autonomous driving.
To intricately marry the semantics inherent in images with the geometric intricacies of LiDAR point clouds, we propose UniM$2$AE.
arXiv Detail & Related papers (2023-08-21T02:13:40Z) - UniTR: A Unified and Efficient Multi-Modal Transformer for
Bird's-Eye-View Representation [113.35352122662752]
We present an efficient multi-modal backbone for outdoor 3D perception named UniTR.
UniTR processes a variety of modalities with unified modeling and shared parameters.
UniTR is also a fundamentally task-agnostic backbone that naturally supports different 3D perception tasks.
arXiv Detail & Related papers (2023-08-15T12:13:44Z) - General-Purpose Multimodal Transformer meets Remote Sensing Semantic
Segmentation [35.100738362291416]
Multimodal AI seeks to exploit complementary data sources, particularly for complex tasks like semantic segmentation.
Recent trends in general-purpose multimodal networks have shown great potential to achieve state-of-the-art performance.
We propose a UNet-inspired module that employs 3D convolution to encode vital local information and learn cross-modal features simultaneously.
arXiv Detail & Related papers (2023-07-07T04:58:34Z) - SimDistill: Simulated Multi-modal Distillation for BEV 3D Object
Detection [56.24700754048067]
Multi-view camera-based 3D object detection has become popular due to its low cost, but accurately inferring 3D geometry solely from camera data remains challenging.
We propose a Simulated multi-modal Distillation (SimDistill) method by carefully crafting the model architecture and distillation strategy.
Our SimDistill can learn better feature representations for 3D object detection while maintaining a cost-effective camera-only deployment.
arXiv Detail & Related papers (2023-03-29T16:08:59Z) - Mirror U-Net: Marrying Multimodal Fission with Multi-task Learning for
Semantic Segmentation in Medical Imaging [19.011295977183835]
We propose Mirror U-Net, which replaces traditional fusion methods with multimodal fission.
Mirror U-Net assigns a task tailored to each modality to reinforce unimodal features while preserving multimodal features in the shared representation.
We evaluate Mirror U-Net on the AutoPET PET/CT and on the multimodal MSD BrainTumor datasets, demonstrating its effectiveness in multimodal segmentation and achieving state-of-the-art performance on both datasets.
arXiv Detail & Related papers (2023-03-13T13:57:29Z) - A Simple Baseline for Multi-Camera 3D Object Detection [94.63944826540491]
3D object detection with surrounding cameras has been a promising direction for autonomous driving.
We present SimMOD, a Simple baseline for Multi-camera Object Detection.
We conduct extensive experiments on the 3D object detection benchmark of nuScenes to demonstrate the effectiveness of SimMOD.
arXiv Detail & Related papers (2022-08-22T03:38:01Z) - BEVerse: Unified Perception and Prediction in Birds-Eye-View for
Vision-Centric Autonomous Driving [92.05963633802979]
We present BEVerse, a unified framework for 3D perception and prediction based on multi-camera systems.
We show that the multi-task BEVerse outperforms single-task methods on 3D object detection, semantic map construction, and motion prediction.
arXiv Detail & Related papers (2022-05-19T17:55:35Z) - A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine
Translation [131.33610549540043]
We propose a novel graph-based multi-modal fusion encoder for NMT.
We first represent the input sentence and image using a unified multi-modal graph.
We then stack multiple graph-based multi-modal fusion layers that iteratively perform semantic interactions to learn node representations.
arXiv Detail & Related papers (2020-07-17T04:06:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.