BEV-MAE: Bird's Eye View Masked Autoencoders for Point Cloud
Pre-training in Autonomous Driving Scenarios
- URL: http://arxiv.org/abs/2212.05758v2
- Date: Sun, 21 Jan 2024 03:51:31 GMT
- Title: BEV-MAE: Bird's Eye View Masked Autoencoders for Point Cloud
Pre-training in Autonomous Driving Scenarios
- Authors: Zhiwei Lin, Yongtao Wang, Shengxiang Qi, Nan Dong, Ming-Hsuan Yang
- Abstract summary: We present BEV-MAE, an efficient masked autoencoder pre-training framework for LiDAR-based 3D object detection in autonomous driving.
Specifically, we propose a bird's eye view (BEV) guided masking strategy to guide the 3D encoder learning feature representation.
We introduce a learnable point token to maintain a consistent receptive field size of the 3D encoder.
- Score: 51.285561119993105
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Existing LiDAR-based 3D object detection methods for autonomous driving
scenarios mainly adopt the training-from-scratch paradigm. Unfortunately, this
paradigm heavily relies on large-scale labeled data, whose collection can be
expensive and time-consuming. Self-supervised pre-training is an effective and
desirable way to alleviate this dependence on extensive annotated data. In this
work, we present BEV-MAE, an efficient masked autoencoder pre-training
framework for LiDAR-based 3D object detection in autonomous driving.
Specifically, we propose a bird's eye view (BEV) guided masking strategy to
guide the 3D encoder learning feature representation in a BEV perspective and
avoid complex decoder design during pre-training. Furthermore, we introduce a
learnable point token to maintain a consistent receptive field size of the 3D
encoder with fine-tuning for masked point cloud inputs. Based on the property
of outdoor point clouds in autonomous driving scenarios, i.e., the point clouds
of distant objects are more sparse, we propose point density prediction to
enable the 3D encoder to learn location information, which is essential for
object detection. Experimental results show that BEV-MAE surpasses prior
state-of-the-art self-supervised methods and achieves a favorably pre-training
efficiency. Furthermore, based on TransFusion-L, BEV-MAE achieves new
state-of-the-art LiDAR-based 3D object detection results, with 73.6 NDS and
69.6 mAP on the nuScenes benchmark. The source code will be released at
https://github.com/VDIGPKU/BEV-MAE
Related papers
- End-to-End 3D Object Detection using LiDAR Point Cloud [0.0]
We present an approach wherein, using a novel encoding of the LiDAR point cloud we infer the location of different classes near the autonomous vehicles.
The output is predictions about the location and orientation of objects in the scene in form of 3D bounding boxes and labels of scene objects.
arXiv Detail & Related papers (2023-12-24T00:52:14Z) - FocalFormer3D : Focusing on Hard Instance for 3D Object Detection [97.56185033488168]
False negatives (FN) in 3D object detection can lead to potentially dangerous situations in autonomous driving.
In this work, we propose Hard Instance Probing (HIP), a general pipeline that identifies textitFN in a multi-stage manner.
We instantiate this method as FocalFormer3D, a simple yet effective detector that excels at excavating difficult objects.
arXiv Detail & Related papers (2023-08-08T20:06:12Z) - OCBEV: Object-Centric BEV Transformer for Multi-View 3D Object Detection [29.530177591608297]
Multi-view 3D object detection is becoming popular in autonomous driving due to its high effectiveness and low cost.
Most of the current state-of-the-art detectors follow the query-based bird's-eye-view (BEV) paradigm.
We propose an Object-Centric query-BEV detector OCBEV, which can carve the temporal and spatial cues of moving targets more effectively.
arXiv Detail & Related papers (2023-06-02T17:59:48Z) - Weakly Supervised Monocular 3D Object Detection using Multi-View
Projection and Direction Consistency [78.76508318592552]
Monocular 3D object detection has become a mainstream approach in automatic driving for its easy application.
Most current methods still rely on 3D point cloud data for labeling the ground truths used in the training phase.
We propose a new weakly supervised monocular 3D objection detection method, which can train the model with only 2D labels marked on images.
arXiv Detail & Related papers (2023-03-15T15:14:00Z) - OA-BEV: Bringing Object Awareness to Bird's-Eye-View Representation for
Multi-Camera 3D Object Detection [78.38062015443195]
OA-BEV is a network that can be plugged into the BEV-based 3D object detection framework.
Our method achieves consistent improvements over the BEV-based baselines in terms of both average precision and nuScenes detection score.
arXiv Detail & Related papers (2023-01-13T06:02:31Z) - MAELi: Masked Autoencoder for Large-Scale LiDAR Point Clouds [13.426810473131642]
Masked AutoEncoder for LiDAR point clouds (MAELi) intuitively leverages the sparsity of LiDAR point clouds in both the encoder and decoder during reconstruction.
In a novel reconstruction approach, MAELi distinguishes between empty and occluded space.
Thereby, without any ground truth whatsoever and trained on single frames only, MAELi obtains an understanding of the underlying 3D scene geometry and semantics.
arXiv Detail & Related papers (2022-12-14T13:10:27Z) - Occupancy-MAE: Self-supervised Pre-training Large-scale LiDAR Point
Clouds with Masked Occupancy Autoencoders [13.119676419877244]
We propose a solution to reduce the dependence on labelled 3D training data by leveraging pre-training on large-scale unlabeled outdoor LiDAR point clouds.
Our approach introduces a new self-supervised masked occupancy pre-training method called Occupancy-MAE.
For 3D object detection, Occupancy-MAE reduces the labelled data required for car detection on the KITTI dataset by half.
For 3D semantic segmentation, Occupancy-MAE outperforms training from scratch by around 2% in mIoU.
arXiv Detail & Related papers (2022-06-20T17:15:50Z) - BEVerse: Unified Perception and Prediction in Birds-Eye-View for
Vision-Centric Autonomous Driving [92.05963633802979]
We present BEVerse, a unified framework for 3D perception and prediction based on multi-camera systems.
We show that the multi-task BEVerse outperforms single-task methods on 3D object detection, semantic map construction, and motion prediction.
arXiv Detail & Related papers (2022-05-19T17:55:35Z) - RAANet: Range-Aware Attention Network for LiDAR-based 3D Object
Detection with Auxiliary Density Level Estimation [11.180128679075716]
Range-Aware Attention Network (RAANet) is developed for 3D object detection from LiDAR data for autonomous driving.
RAANet extracts more powerful BEV features and generates superior 3D object detections.
Experiments on nuScenes dataset demonstrate that our proposed approach outperforms the state-of-the-art methods for LiDAR-based 3D object detection.
arXiv Detail & Related papers (2021-11-18T04:20:13Z) - InfoFocus: 3D Object Detection for Autonomous Driving with Dynamic
Information Modeling [65.47126868838836]
We propose a novel 3D object detection framework with dynamic information modeling.
Coarse predictions are generated in the first stage via a voxel-based region proposal network.
Experiments are conducted on the large-scale nuScenes 3D detection benchmark.
arXiv Detail & Related papers (2020-07-16T18:27:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.