Geometric-aware Pretraining for Vision-centric 3D Object Detection
- URL: http://arxiv.org/abs/2304.03105v2
- Date: Fri, 7 Apr 2023 16:31:07 GMT
- Title: Geometric-aware Pretraining for Vision-centric 3D Object Detection
- Authors: Linyan Huang, Huijie Wang, Jia Zeng, Shengchuan Zhang, Liujuan Cao,
Junchi Yan, Hongyang Li
- Abstract summary: We propose a novel geometric-aware pretraining framework called GAPretrain.
GAPretrain serves as a plug-and-play solution that can be flexibly applied to multiple state-of-the-art detectors.
We achieve 46.2 mAP and 55.5 NDS on the nuScenes val set using the BEVFormer method, with a gain of 2.7 and 2.1 points, respectively.
- Score: 77.7979088689944
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Multi-camera 3D object detection for autonomous driving is a challenging
problem that has garnered notable attention from both academia and industry. An
obstacle encountered in vision-based techniques involves the precise extraction
of geometry-conscious features from RGB images. Recent approaches have utilized
geometric-aware image backbones pretrained on depth-relevant tasks to acquire
spatial information. However, these approaches overlook the critical aspect of
view transformation, resulting in inadequate performance due to the
misalignment of spatial knowledge between the image backbone and view
transformation. To address this issue, we propose a novel geometric-aware
pretraining framework called GAPretrain. Our approach incorporates spatial and
structural cues to camera networks by employing the geometric-rich modality as
guidance during the pretraining phase. The transference of modal-specific
attributes across different modalities is non-trivial, but we bridge this gap
by using a unified bird's-eye-view (BEV) representation and structural hints
derived from LiDAR point clouds to facilitate the pretraining process.
GAPretrain serves as a plug-and-play solution that can be flexibly applied to
multiple state-of-the-art detectors. Our experiments demonstrate the
effectiveness and generalization ability of the proposed method. We achieve
46.2 mAP and 55.5 NDS on the nuScenes val set using the BEVFormer method, with
a gain of 2.7 and 2.1 points, respectively. We also conduct experiments on
various image backbones and view transformations to validate the efficacy of
our approach. Code will be released at
https://github.com/OpenDriveLab/BEVPerception-Survey-Recipe.
Related papers
- LiOn-XA: Unsupervised Domain Adaptation via LiDAR-Only Cross-Modal Adversarial Training [61.26381389532653]
LiOn-XA is an unsupervised domain adaptation (UDA) approach that combines LiDAR-Only Cross-Modal (X) learning with Adversarial training for 3D LiDAR point cloud semantic segmentation.
Our experiments on 3 real-to-real adaptation scenarios demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2024-10-21T09:50:17Z) - Towards Unified 3D Object Detection via Algorithm and Data Unification [70.27631528933482]
We build the first unified multi-modal 3D object detection benchmark MM- Omni3D and extend the aforementioned monocular detector to its multi-modal version.
We name the designed monocular and multi-modal detectors as UniMODE and MM-UniMODE, respectively.
arXiv Detail & Related papers (2024-02-28T18:59:31Z) - Self-supervised Learning of LiDAR 3D Point Clouds via 2D-3D Neural Calibration [107.61458720202984]
This paper introduces a novel self-supervised learning framework for enhancing 3D perception in autonomous driving scenes.
We propose the learnable transformation alignment to bridge the domain gap between image and point cloud data.
We establish dense 2D-3D correspondences to estimate the rigid pose.
arXiv Detail & Related papers (2024-01-23T02:41:06Z) - Towards Generalizable Multi-Camera 3D Object Detection via Perspective
Debiasing [28.874014617259935]
Multi-Camera 3D Object Detection (MC3D-Det) has gained prominence with the advent of bird's-eye view (BEV) approaches.
We propose a novel method that aligns 3D detection with 2D camera plane results, ensuring consistent and accurate detections.
arXiv Detail & Related papers (2023-10-17T15:31:28Z) - Parametric Depth Based Feature Representation Learning for Object
Detection and Segmentation in Bird's Eye View [44.78243406441798]
This paper focuses on leveraging geometry information, such as depth, to model such feature transformation.
We first lift the 2D image features to the 3D space defined for the ego vehicle via a predicted parametric depth distribution for each pixel in each view.
We then aggregate the 3D feature volume based on the 3D space occupancy derived from depth to the BEV frame.
arXiv Detail & Related papers (2023-07-09T06:07:22Z) - Unleash the Potential of Image Branch for Cross-modal 3D Object
Detection [67.94357336206136]
We present a new cross-modal 3D object detector, namely UPIDet, which aims to unleash the potential of the image branch from two aspects.
First, UPIDet introduces a new 2D auxiliary task called normalized local coordinate map estimation.
Second, we discover that the representational capability of the point cloud backbone can be enhanced through the gradients backpropagated from the training objectives of the image branch.
arXiv Detail & Related papers (2023-01-22T08:26:58Z) - SimIPU: Simple 2D Image and 3D Point Cloud Unsupervised Pre-Training for
Spatial-Aware Visual Representations [85.38562724999898]
We propose a 2D Image and 3D Point cloud Unsupervised pre-training strategy, called SimIPU.
Specifically, we develop a multi-modal contrastive learning framework that consists of an intra-modal spatial perception module and an inter-modal feature interaction module.
To the best of our knowledge, this is the first study to explore contrastive learning pre-training strategies for outdoor multi-modal datasets.
arXiv Detail & Related papers (2021-12-09T03:27:00Z) - Exploring intermediate representation for monocular vehicle pose
estimation [38.85309013717312]
We present a new learning-based framework to recover vehicle pose in SO(3) from a single RGB image.
In contrast to previous works that map from local appearance to observation angles, we explore a progressive approach by extracting meaningful Intermediate Geometrical Representations (IGRs)
This approach features a deep model that transforms perceived intensities to IGRs, which are mapped to a 3D representation encoding object orientation in the camera coordinate system.
arXiv Detail & Related papers (2020-11-17T06:30:51Z) - Monocular 3D Detection with Geometric Constraints Embedding and
Semi-supervised Training [3.8073142980733]
We propose a novel framework for monocular 3D objects detection using only RGB images, called KM3D-Net.
We design a fully convolutional model to predict object keypoints, dimension, and orientation, and then combine these estimations with perspective geometry constraints to compute position attribute.
arXiv Detail & Related papers (2020-09-02T00:51:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.