Occupancy-MAE: Self-supervised Pre-training Large-scale LiDAR Point
Clouds with Masked Occupancy Autoencoders
- URL: http://arxiv.org/abs/2206.09900v7
- Date: Mon, 9 Oct 2023 12:34:02 GMT
- Title: Occupancy-MAE: Self-supervised Pre-training Large-scale LiDAR Point
Clouds with Masked Occupancy Autoencoders
- Authors: Chen Min and Xinli Xu and Dawei Zhao and Liang Xiao and Yiming Nie and
Bin Dai
- Abstract summary: We propose a solution to reduce the dependence on labelled 3D training data by leveraging pre-training on large-scale unlabeled outdoor LiDAR point clouds.
Our approach introduces a new self-supervised masked occupancy pre-training method called Occupancy-MAE.
For 3D object detection, Occupancy-MAE reduces the labelled data required for car detection on the KITTI dataset by half.
For 3D semantic segmentation, Occupancy-MAE outperforms training from scratch by around 2% in mIoU.
- Score: 13.119676419877244
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current perception models in autonomous driving heavily rely on large-scale
labelled 3D data, which is both costly and time-consuming to annotate. This
work proposes a solution to reduce the dependence on labelled 3D training data
by leveraging pre-training on large-scale unlabeled outdoor LiDAR point clouds
using masked autoencoders (MAE). While existing masked point autoencoding
methods mainly focus on small-scale indoor point clouds or pillar-based
large-scale outdoor LiDAR data, our approach introduces a new self-supervised
masked occupancy pre-training method called Occupancy-MAE, specifically
designed for voxel-based large-scale outdoor LiDAR point clouds. Occupancy-MAE
takes advantage of the gradually sparse voxel occupancy structure of outdoor
LiDAR point clouds and incorporates a range-aware random masking strategy and a
pretext task of occupancy prediction. By randomly masking voxels based on their
distance to the LiDAR and predicting the masked occupancy structure of the
entire 3D surrounding scene, Occupancy-MAE encourages the extraction of
high-level semantic information to reconstruct the masked voxel using only a
small number of visible voxels. Extensive experiments demonstrate the
effectiveness of Occupancy-MAE across several downstream tasks. For 3D object
detection, Occupancy-MAE reduces the labelled data required for car detection
on the KITTI dataset by half and improves small object detection by
approximately 2% in AP on the Waymo dataset. For 3D semantic segmentation,
Occupancy-MAE outperforms training from scratch by around 2% in mIoU. For
multi-object tracking, Occupancy-MAE enhances training from scratch by
approximately 1% in terms of AMOTA and AMOTP. Codes are publicly available at
https://github.com/chaytonmin/Occupancy-MAE.
Related papers
- Triple Point Masking [49.39218611030084]
Existing 3D mask learning methods encounter performance bottlenecks under limited data.
We introduce a triple point masking scheme, named TPM, which serves as a scalable framework for pre-training of masked autoencoders.
Extensive experiments show that the four baselines equipped with the proposed TPM achieve comprehensive performance improvements on various downstream tasks.
arXiv Detail & Related papers (2024-09-26T05:33:30Z) - Sense Less, Generate More: Pre-training LiDAR Perception with Masked Autoencoders for Ultra-Efficient 3D Sensing [0.6340101348986665]
We propose a disruptively frugal LiDAR perception dataflow that generates rather than senses parts of the environment that are either predictable based on the extensive training of the environment or have limited consequence to the overall prediction accuracy.
Our proposed generative pre-training strategy for this purpose, called as radially masked autoencoding (R-MAE), can also be readily implemented in a typical LiDAR system by selectively activating and controlling the laser power for randomly generated angular regions during on-field operations.
arXiv Detail & Related papers (2024-06-12T03:02:54Z) - Learning Shared RGB-D Fields: Unified Self-supervised Pre-training for Label-efficient LiDAR-Camera 3D Perception [17.11366229887873]
We introduce a unified pretraining strategy, NeRF-Supervised Masked Auto (NS-MAE)
NS-MAE exploits NeRF's ability to encode both appearance and geometry, enabling efficient masked reconstruction of multi-modal data.
Results: NS-MAE outperforms prior SOTA pre-training methods that employ separate strategies for each modality.
arXiv Detail & Related papers (2024-05-28T08:13:49Z) - OccNeRF: Advancing 3D Occupancy Prediction in LiDAR-Free Environments [77.0399450848749]
We propose an OccNeRF method for training occupancy networks without 3D supervision.
We parameterize the reconstructed occupancy fields and reorganize the sampling strategy to align with the cameras' infinite perceptive range.
For semantic occupancy prediction, we design several strategies to polish the prompts and filter the outputs of a pretrained open-vocabulary 2D segmentation model.
arXiv Detail & Related papers (2023-12-14T18:58:52Z) - MAELi: Masked Autoencoder for Large-Scale LiDAR Point Clouds [13.426810473131642]
Masked AutoEncoder for LiDAR point clouds (MAELi) intuitively leverages the sparsity of LiDAR point clouds in both the encoder and decoder during reconstruction.
In a novel reconstruction approach, MAELi distinguishes between empty and occluded space.
Thereby, without any ground truth whatsoever and trained on single frames only, MAELi obtains an understanding of the underlying 3D scene geometry and semantics.
arXiv Detail & Related papers (2022-12-14T13:10:27Z) - BEV-MAE: Bird's Eye View Masked Autoencoders for Point Cloud
Pre-training in Autonomous Driving Scenarios [51.285561119993105]
We present BEV-MAE, an efficient masked autoencoder pre-training framework for LiDAR-based 3D object detection in autonomous driving.
Specifically, we propose a bird's eye view (BEV) guided masking strategy to guide the 3D encoder learning feature representation.
We introduce a learnable point token to maintain a consistent receptive field size of the 3D encoder.
arXiv Detail & Related papers (2022-12-12T08:15:03Z) - Masked Autoencoders for Self-Supervised Learning on Automotive Point
Clouds [2.8544513613730205]
Masked autoencoding has become a successful pre-training paradigm for Transformer models for text, images, and recently, point clouds.
We propose VoxelMAE, a simple masked autoencoding pretraining scheme designed for voxel representations.
Our method improves the 3D OD performance by 1.75 mAP points and 1.05 NDS on the challenging nuScenes dataset.
arXiv Detail & Related papers (2022-07-01T16:31:45Z) - Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud
Pre-training [56.81809311892475]
Masked Autoencoders (MAE) have shown great potentials in self-supervised pre-training for language and 2D image transformers.
We propose Point-M2AE, a strong Multi-scale MAE pre-training framework for hierarchical self-supervised learning of 3D point clouds.
arXiv Detail & Related papers (2022-05-28T11:22:53Z) - Lifting 2D Object Locations to 3D by Discounting LiDAR Outliers across
Objects and Views [70.1586005070678]
We present a system for automatically converting 2D mask object predictions and raw LiDAR point clouds into full 3D bounding boxes of objects.
Our method significantly outperforms previous work despite the fact that those methods use significantly more complex pipelines, 3D models and additional human-annotated external sources of prior information.
arXiv Detail & Related papers (2021-09-16T13:01:13Z) - InfoFocus: 3D Object Detection for Autonomous Driving with Dynamic
Information Modeling [65.47126868838836]
We propose a novel 3D object detection framework with dynamic information modeling.
Coarse predictions are generated in the first stage via a voxel-based region proposal network.
Experiments are conducted on the large-scale nuScenes 3D detection benchmark.
arXiv Detail & Related papers (2020-07-16T18:27:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.