Mask3D: Pre-training 2D Vision Transformers by Learning Masked 3D Priors
- URL: http://arxiv.org/abs/2302.14746v1
- Date: Tue, 28 Feb 2023 16:45:21 GMT
- Title: Mask3D: Pre-training 2D Vision Transformers by Learning Masked 3D Priors
- Authors: Ji Hou, Xiaoliang Dai, Zijian He, Angela Dai, Matthias Nie{\ss}ner
- Abstract summary: We propose Mask3D to leverage existing large-scale RGB-D data in a self-supervised pre-training to embed 3D priors into 2D learned feature representations.
We demonstrate Mask3D is particularly effective in embedding 3D priors into the powerful 2D ViT backbone, enabling improved representation learning for various scene understanding tasks.
- Score: 29.419069066603438
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current popular backbones in computer vision, such as Vision Transformers
(ViT) and ResNets are trained to perceive the world from 2D images. However, to
more effectively understand 3D structural priors in 2D backbones, we propose
Mask3D to leverage existing large-scale RGB-D data in a self-supervised
pre-training to embed these 3D priors into 2D learned feature representations.
In contrast to traditional 3D contrastive learning paradigms requiring 3D
reconstructions or multi-view correspondences, our approach is simple: we
formulate a pre-text reconstruction task by masking RGB and depth patches in
individual RGB-D frames. We demonstrate the Mask3D is particularly effective in
embedding 3D priors into the powerful 2D ViT backbone, enabling improved
representation learning for various scene understanding tasks, such as semantic
segmentation, instance segmentation and object detection. Experiments show that
Mask3D notably outperforms existing self-supervised 3D pre-training approaches
on ScanNet, NYUv2, and Cityscapes image understanding tasks, with an
improvement of +6.5% mIoU against the state-of-the-art Pri3D on ScanNet image
semantic segmentation.
Related papers
- MOSE: Monocular Semantic Reconstruction Using NeRF-Lifted Noisy Priors [11.118490283303407]
We propose a neural field semantic reconstruction approach to lift inferred image-level noisy priors to 3D.
Our method produces accurate semantics and geometry in both 3D and 2D space.
arXiv Detail & Related papers (2024-09-21T05:12:13Z) - Magic123: One Image to High-Quality 3D Object Generation Using Both 2D
and 3D Diffusion Priors [104.79392615848109]
We present Magic123, a two-stage coarse-to-fine approach for high-quality, textured 3D meshes from a single unposed image.
In the first stage, we optimize a neural radiance field to produce a coarse geometry.
In the second stage, we adopt a memory-efficient differentiable mesh representation to yield a high-resolution mesh with a visually appealing texture.
arXiv Detail & Related papers (2023-06-30T17:59:08Z) - SSR-2D: Semantic 3D Scene Reconstruction from 2D Images [54.46126685716471]
In this work, we explore a central 3D scene modeling task, namely, semantic scene reconstruction without using any 3D annotations.
The key idea of our approach is to design a trainable model that employs both incomplete 3D reconstructions and their corresponding source RGB-D images.
Our method achieves the state-of-the-art performance of semantic scene completion on two large-scale benchmark datasets MatterPort3D and ScanNet.
arXiv Detail & Related papers (2023-02-07T17:47:52Z) - Learning 3D Representations from 2D Pre-trained Models via
Image-to-Point Masked Autoencoders [52.91248611338202]
We propose an alternative to obtain superior 3D representations from 2D pre-trained models via Image-to-Point Masked Autoencoders, named as I2P-MAE.
By self-supervised pre-training, we leverage the well learned 2D knowledge to guide 3D masked autoencoding.
I2P-MAE attains the state-of-the-art 90.11% accuracy, +3.68% to the second-best, demonstrating superior transferable capacity.
arXiv Detail & Related papers (2022-12-13T17:59:20Z) - MvDeCor: Multi-view Dense Correspondence Learning for Fine-grained 3D
Segmentation [91.6658845016214]
We propose to utilize self-supervised techniques in the 2D domain for fine-grained 3D shape segmentation tasks.
We render a 3D shape from multiple views, and set up a dense correspondence learning task within the contrastive learning framework.
As a result, the learned 2D representations are view-invariant and geometrically consistent.
arXiv Detail & Related papers (2022-08-18T00:48:15Z) - Unsupervised Learning of Visual 3D Keypoints for Control [104.92063943162896]
Learning sensorimotor control policies from high-dimensional images crucially relies on the quality of the underlying visual representations.
We propose a framework to learn such a 3D geometric structure directly from images in an end-to-end unsupervised manner.
These discovered 3D keypoints tend to meaningfully capture robot joints as well as object movements in a consistent manner across both time and 3D space.
arXiv Detail & Related papers (2021-06-14T17:59:59Z) - Pri3D: Can 3D Priors Help 2D Representation Learning? [37.35721274841419]
We introduce an approach to learn view-invariant,geometry-aware representations for network pre-training.
We employ contrastive learning under both multi-view im-age constraints and image-geometry constraints to encode3D priors into learned 2D representations.
arXiv Detail & Related papers (2021-04-22T17:59:30Z) - 3D-to-2D Distillation for Indoor Scene Parsing [78.36781565047656]
We present a new approach that enables us to leverage 3D features extracted from large-scale 3D data repository to enhance 2D features extracted from RGB images.
First, we distill 3D knowledge from a pretrained 3D network to supervise a 2D network to learn simulated 3D features from 2D features during the training.
Second, we design a two-stage dimension normalization scheme to calibrate the 2D and 3D features for better integration.
Third, we design a semantic-aware adversarial training model to extend our framework for training with unpaired 3D data.
arXiv Detail & Related papers (2021-04-06T02:22:24Z) - Virtual Multi-view Fusion for 3D Semantic Segmentation [11.259694096475766]
We show that our virtual views enable more effective training of 2D semantic segmentation networks than previous multiview approaches.
When the 2D per pixel predictions are aggregated on 3D surfaces, our virtual multiview fusion method is able to achieve significantly better 3D semantic segmentation results.
arXiv Detail & Related papers (2020-07-26T14:46:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.