GeoMIM: Towards Better 3D Knowledge Transfer via Masked Image Modeling
for Multi-view 3D Understanding
- URL: http://arxiv.org/abs/2303.11325v2
- Date: Mon, 28 Aug 2023 08:00:52 GMT
- Title: GeoMIM: Towards Better 3D Knowledge Transfer via Masked Image Modeling
for Multi-view 3D Understanding
- Authors: Jihao Liu, Tai Wang, Boxiao Liu, Qihang Zhang, Yu Liu, Hongsheng Li
- Abstract summary: Multi-view camera-based 3D detection is a challenging problem in computer vision.
Recent works leverage a pretrained LiDAR detection model to transfer knowledge to a camera-based student network.
We propose Enhanced Geometry Masked Image Modeling (GeoMIM) to transfer the knowledge of the LiDAR model in a pretrain-finetune paradigm.
- Score: 42.780417042750315
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-view camera-based 3D detection is a challenging problem in computer
vision. Recent works leverage a pretrained LiDAR detection model to transfer
knowledge to a camera-based student network. However, we argue that there is a
major domain gap between the LiDAR BEV features and the camera-based BEV
features, as they have different characteristics and are derived from different
sources. In this paper, we propose Geometry Enhanced Masked Image Modeling
(GeoMIM) to transfer the knowledge of the LiDAR model in a pretrain-finetune
paradigm for improving the multi-view camera-based 3D detection. GeoMIM is a
multi-camera vision transformer with Cross-View Attention (CVA) blocks that
uses LiDAR BEV features encoded by the pretrained BEV model as learning
targets. During pretraining, GeoMIM's decoder has a semantic branch completing
dense perspective-view features and the other geometry branch reconstructing
dense perspective-view depth maps. The depth branch is designed to be
camera-aware by inputting the camera's parameters for better transfer
capability. Extensive results demonstrate that GeoMIM outperforms existing
methods on nuScenes benchmark, achieving state-of-the-art performance for
camera-based 3D object detection and 3D segmentation. Code and pretrained
models are available at https://github.com/Sense-X/GeoMIM.
Related papers
- VFMM3D: Releasing the Potential of Image by Vision Foundation Model for Monocular 3D Object Detection [80.62052650370416]
monocular 3D object detection holds significant importance across various applications, including autonomous driving and robotics.
In this paper, we present VFMM3D, an innovative framework that leverages the capabilities of Vision Foundation Models (VFMs) to accurately transform single-view images into LiDAR point cloud representations.
arXiv Detail & Related papers (2024-04-15T03:12:12Z) - Towards Unified 3D Object Detection via Algorithm and Data Unification [70.27631528933482]
We build the first unified multi-modal 3D object detection benchmark MM- Omni3D and extend the aforementioned monocular detector to its multi-modal version.
We name the designed monocular and multi-modal detectors as UniMODE and MM-UniMODE, respectively.
arXiv Detail & Related papers (2024-02-28T18:59:31Z) - M&M3D: Multi-Dataset Training and Efficient Network for Multi-view 3D
Object Detection [2.5158048364984564]
I proposed a network structure for multi-view 3D object detection using camera-only data and a Bird's-Eye-View map.
My work is based on a current key challenge domain adaptation and visual data transfer.
My study utilizes 3D information as available semantic information and 2D multi-view image features blending into the visual-language transfer design.
arXiv Detail & Related papers (2023-11-02T04:28:51Z) - Towards Generalizable Multi-Camera 3D Object Detection via Perspective
Debiasing [28.874014617259935]
Multi-Camera 3D Object Detection (MC3D-Det) has gained prominence with the advent of bird's-eye view (BEV) approaches.
We propose a novel method that aligns 3D detection with 2D camera plane results, ensuring consistent and accurate detections.
arXiv Detail & Related papers (2023-10-17T15:31:28Z) - DistillBEV: Boosting Multi-Camera 3D Object Detection with Cross-Modal
Knowledge Distillation [25.933070263556374]
3D perception based on representations learned from multi-camera bird's-eye-view (BEV) is trending as cameras are cost-effective for mass production in autonomous driving industry.
There exists a distinct performance gap between multi-camera BEV and LiDAR based 3D object detection.
We propose to boost the representation learning of a multi-camera BEV based student detector by training it to imitate the features of a well-trained LiDAR based teacher detector.
arXiv Detail & Related papers (2023-09-26T17:56:21Z) - Geometric-aware Pretraining for Vision-centric 3D Object Detection [77.7979088689944]
We propose a novel geometric-aware pretraining framework called GAPretrain.
GAPretrain serves as a plug-and-play solution that can be flexibly applied to multiple state-of-the-art detectors.
We achieve 46.2 mAP and 55.5 NDS on the nuScenes val set using the BEVFormer method, with a gain of 2.7 and 2.1 points, respectively.
arXiv Detail & Related papers (2023-04-06T14:33:05Z) - Viewpoint Equivariance for Multi-View 3D Object Detection [35.4090127133834]
State-of-the-art methods focus on reasoning and decoding object bounding boxes from multi-view camera input.
We introduce VEDet, a novel 3D object detection framework that exploits 3D multi-view geometry.
arXiv Detail & Related papers (2023-03-25T19:56:41Z) - BEVDistill: Cross-Modal BEV Distillation for Multi-View 3D Object
Detection [17.526914782562528]
3D object detection from multiple image views is a challenging task for visual scene understanding.
We propose textbfBEVDistill, a cross-modal BEV knowledge distillation framework for multi-view 3D object detection.
Our best model achieves 59.4 NDS on the nuScenes test leaderboard, achieving new state-of-the-art in comparison with various image-based detectors.
arXiv Detail & Related papers (2022-11-17T07:26:14Z) - A Simple Baseline for Multi-Camera 3D Object Detection [94.63944826540491]
3D object detection with surrounding cameras has been a promising direction for autonomous driving.
We present SimMOD, a Simple baseline for Multi-camera Object Detection.
We conduct extensive experiments on the 3D object detection benchmark of nuScenes to demonstrate the effectiveness of SimMOD.
arXiv Detail & Related papers (2022-08-22T03:38:01Z) - Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled
Representation [57.11299763566534]
We present a solution to recover 3D pose from multi-view images captured with spatially calibrated cameras.
We exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points.
Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections.
arXiv Detail & Related papers (2020-04-05T12:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.