Learning Shared RGB-D Fields: Unified Self-supervised Pre-training for Label-efficient LiDAR-Camera 3D Perception
- URL: http://arxiv.org/abs/2405.17942v2
- Date: Fri, 11 Oct 2024 22:01:56 GMT
- Title: Learning Shared RGB-D Fields: Unified Self-supervised Pre-training for Label-efficient LiDAR-Camera 3D Perception
- Authors: Xiaohao Xu, Ye Li, Tianyi Zhang, Jinrong Yang, Matthew Johnson-Roberson, Xiaonan Huang,
- Abstract summary: We introduce a unified pretraining strategy, NeRF-Supervised Masked Auto (NS-MAE)
NS-MAE exploits NeRF's ability to encode both appearance and geometry, enabling efficient masked reconstruction of multi-modal data.
Results: NS-MAE outperforms prior SOTA pre-training methods that employ separate strategies for each modality.
- Score: 17.11366229887873
- License:
- Abstract: Constructing large-scale labeled datasets for multi-modal perception model training in autonomous driving presents significant challenges. This has motivated the development of self-supervised pretraining strategies. However, existing pretraining methods mainly employ distinct approaches for each modality. In contrast, we focus on LiDAR-Camera 3D perception models and introduce a unified pretraining strategy, NeRF-Supervised Masked Auto Encoder (NS-MAE), which optimizes all modalities through a shared formulation. NS-MAE leverages NeRF's ability to encode both appearance and geometry, enabling efficient masked reconstruction of multi-modal data. Specifically, embeddings are extracted from corrupted LiDAR point clouds and images, conditioned on view directions and locations. Then, these embeddings are rendered into multi-modal feature maps from two crucial viewpoints for 3D driving perception: perspective and bird's-eye views. The original uncorrupted data serve as reconstruction targets for self-supervised learning. Extensive experiments demonstrate the superior transferability of NS-MAE across various 3D perception tasks under different fine-tuning settings. Notably, NS-MAE outperforms prior SOTA pre-training methods that employ separate strategies for each modality in BEV map segmentation under the label-efficient fine-tuning setting. Our code is publicly available at https://github.com/Xiaohao-Xu/Unified-Pretrain-AD/ .
Related papers
- FSD-BEV: Foreground Self-Distillation for Multi-view 3D Object Detection [33.225938984092274]
We propose a Foreground Self-Distillation (FSD) scheme that effectively avoids the issue of distribution discrepancies.
We also design two Point Cloud Intensification ( PCI) strategies to compensate for the sparsity of point clouds.
We develop a Multi-Scale Foreground Enhancement (MSFE) module to extract and fuse multi-scale foreground features.
arXiv Detail & Related papers (2024-07-14T09:39:44Z) - Exploring the Untouched Sweeps for Conflict-Aware 3D Segmentation Pretraining [41.145598142457686]
LiDAR-camera 3D representation pretraining has shown significant promise for 3D perception tasks and related applications.
We propose a novel Vision-Foundation-Model-driven sample exploring module to meticulously select LiDAR-Image pairs from unexplored frames.
Our method consistently outperforms existing state-of-the-art pretraining frameworks across three major public autonomous driving datasets.
arXiv Detail & Related papers (2024-07-10T08:46:29Z) - OccNeRF: Advancing 3D Occupancy Prediction in LiDAR-Free Environments [77.0399450848749]
We propose an OccNeRF method for training occupancy networks without 3D supervision.
We parameterize the reconstructed occupancy fields and reorganize the sampling strategy to align with the cameras' infinite perceptive range.
For semantic occupancy prediction, we design several strategies to polish the prompts and filter the outputs of a pretrained open-vocabulary 2D segmentation model.
arXiv Detail & Related papers (2023-12-14T18:58:52Z) - Towards Transferable Multi-modal Perception Representation Learning for
Autonomy: NeRF-Supervised Masked AutoEncoder [1.90365714903665]
This work proposes a unified self-supervised pre-training framework for transferable multi-modal perception representation learning.
We show that the representation learned via NeRF-Supervised Masked AutoEncoder (NS-MAE) shows promising transferability for diverse multi-modal and single-modal (camera-only and Lidar-only) perception models.
We hope this study can inspire exploration of more general multi-modal representation learning for autonomous agents.
arXiv Detail & Related papers (2023-11-23T00:53:11Z) - Leveraging Large-Scale Pretrained Vision Foundation Models for
Label-Efficient 3D Point Cloud Segmentation [67.07112533415116]
We present a novel framework that adapts various foundational models for the 3D point cloud segmentation task.
Our approach involves making initial predictions of 2D semantic masks using different large vision models.
To generate robust 3D semantic pseudo labels, we introduce a semantic label fusion strategy that effectively combines all the results via voting.
arXiv Detail & Related papers (2023-11-03T15:41:15Z) - SPOT: Scalable 3D Pre-training via Occupancy Prediction for Learning Transferable 3D Representations [76.45009891152178]
Pretraining-finetuning approach can alleviate the labeling burden by fine-tuning a pre-trained backbone across various downstream datasets as well as tasks.
We show, for the first time, that general representations learning can be achieved through the task of occupancy prediction.
Our findings will facilitate the understanding of LiDAR points and pave the way for future advancements in LiDAR pre-training.
arXiv Detail & Related papers (2023-09-19T11:13:01Z) - CALICO: Self-Supervised Camera-LiDAR Contrastive Pre-training for BEV
Perception [32.91233926771015]
CALICO is a novel framework that applies contrastive objectives to both LiDAR and camera backbones.
Our framework can be tailored to different backbones and heads, positioning it as a promising approach for multimodal BEV perception.
arXiv Detail & Related papers (2023-06-01T05:06:56Z) - Geometric-aware Pretraining for Vision-centric 3D Object Detection [77.7979088689944]
We propose a novel geometric-aware pretraining framework called GAPretrain.
GAPretrain serves as a plug-and-play solution that can be flexibly applied to multiple state-of-the-art detectors.
We achieve 46.2 mAP and 55.5 NDS on the nuScenes val set using the BEVFormer method, with a gain of 2.7 and 2.1 points, respectively.
arXiv Detail & Related papers (2023-04-06T14:33:05Z) - Rethinking Range View Representation for LiDAR Segmentation [66.73116059734788]
"Many-to-one" mapping, semantic incoherence, and shape deformation are possible impediments against effective learning from range view projections.
We present RangeFormer, a full-cycle framework comprising novel designs across network architecture, data augmentation, and post-processing.
We show that, for the first time, a range view method is able to surpass the point, voxel, and multi-view fusion counterparts in the competing LiDAR semantic and panoptic segmentation benchmarks.
arXiv Detail & Related papers (2023-03-09T16:13:27Z) - BEV-MAE: Bird's Eye View Masked Autoencoders for Point Cloud
Pre-training in Autonomous Driving Scenarios [51.285561119993105]
We present BEV-MAE, an efficient masked autoencoder pre-training framework for LiDAR-based 3D object detection in autonomous driving.
Specifically, we propose a bird's eye view (BEV) guided masking strategy to guide the 3D encoder learning feature representation.
We introduce a learnable point token to maintain a consistent receptive field size of the 3D encoder.
arXiv Detail & Related papers (2022-12-12T08:15:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.