Self-supervised Pre-training with Masked Shape Prediction for 3D Scene
Understanding
- URL: http://arxiv.org/abs/2305.05026v1
- Date: Mon, 8 May 2023 20:09:19 GMT
- Title: Self-supervised Pre-training with Masked Shape Prediction for 3D Scene
Understanding
- Authors: Li Jiang, Zetong Yang, Shaoshuai Shi, Vladislav Golyanik, Dengxin Dai,
Bernt Schiele
- Abstract summary: Masked Shape Prediction (MSP) is a new framework to conduct masked signal modeling in 3D scenes.
MSP uses the essential 3D semantic cue, i.e., geometric shape, as the prediction target for masked points.
- Score: 106.0876425365599
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Masked signal modeling has greatly advanced self-supervised pre-training for
language and 2D images. However, it is still not fully explored in 3D scene
understanding. Thus, this paper introduces Masked Shape Prediction (MSP), a new
framework to conduct masked signal modeling in 3D scenes. MSP uses the
essential 3D semantic cue, i.e., geometric shape, as the prediction target for
masked points. The context-enhanced shape target consisting of explicit shape
context and implicit deep shape feature is proposed to facilitate exploiting
contextual cues in shape prediction. Meanwhile, the pre-training architecture
in MSP is carefully designed to alleviate the masked shape leakage from point
coordinates. Experiments on multiple 3D understanding tasks on both indoor and
outdoor datasets demonstrate the effectiveness of MSP in learning good feature
representations to consistently boost downstream performance.
Related papers
- XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation [72.12250272218792]
We propose a more meticulous mask-level alignment between 3D features and the 2D-text embedding space through a cross-modal mask reasoning framework, XMask3D.
We integrate 3D global features as implicit conditions into the pre-trained 2D denoising UNet, enabling the generation of segmentation masks.
The generated 2D masks are employed to align mask-level 3D representations with the vision-language feature space, thereby augmenting the open vocabulary capability of 3D geometry embeddings.
arXiv Detail & Related papers (2024-11-20T12:02:12Z) - Fast and Efficient: Mask Neural Fields for 3D Scene Segmentation [47.08813064337934]
This paper presents MaskField, which enables efficient 3D open-vocabulary segmentation with neural fields from a novel perspective.
MaskField decomposes the distillation of mask and semantic features from foundation models by formulating a mask feature field and queries.
Our experiments show that MaskField not only surpasses prior state-of-the-art methods but also achieves remarkably fast convergence.
arXiv Detail & Related papers (2024-07-01T12:07:26Z) - MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling with
Informative-Preserved Reconstruction and Self-Distilled Consistency [120.9499803967496]
We propose a novel informative-preserved reconstruction, which explores local statistics to discover and preserve the representative structured points.
Our method can concentrate on modeling regional geometry and enjoy less ambiguity for masked reconstruction.
By combining informative-preserved reconstruction on masked areas and consistency self-distillation from unmasked areas, a unified framework called MM-3DScene is yielded.
arXiv Detail & Related papers (2022-12-20T01:53:40Z) - 3DLatNav: Navigating Generative Latent Spaces for Semantic-Aware 3D
Object Manipulation [2.8661021832561757]
3D generative models have been recently successful in generating realistic 3D objects in the form of point clouds.
Most models do not offer controllability to manipulate the shape semantics of component object parts without extensive semantic labels or other reference point clouds.
We propose 3DLatNav; a novel approach to navigating pretrained generative latent spaces to enable controlled part-level semantic manipulation of 3D objects.
arXiv Detail & Related papers (2022-11-17T18:47:56Z) - 3D Shape Reconstruction from 2D Images with Disentangled Attribute Flow [61.62796058294777]
Reconstructing 3D shape from a single 2D image is a challenging task.
Most of the previous methods still struggle to extract semantic attributes for 3D reconstruction task.
We propose 3DAttriFlow to disentangle and extract semantic attributes through different semantic levels in the input images.
arXiv Detail & Related papers (2022-03-29T02:03:31Z) - MonoRUn: Monocular 3D Object Detection by Reconstruction and Uncertainty
Propagation [4.202461384355329]
We propose MonoRUn, a novel 3D object detection framework that learns dense correspondences and geometry in a self-supervised manner.
Our proposed approach outperforms current state-of-the-art methods on KITTI benchmark.
arXiv Detail & Related papers (2021-03-23T15:03:08Z) - Cylinder3D: An Effective 3D Framework for Driving-scene LiDAR Semantic
Segmentation [87.54570024320354]
State-of-the-art methods for large-scale driving-scene LiDAR semantic segmentation often project and process the point clouds in the 2D space.
A straightforward solution to tackle the issue of 3D-to-2D projection is to keep the 3D representation and process the points in the 3D space.
We develop a 3D cylinder partition and a 3D cylinder convolution based framework, termed as Cylinder3D, which exploits the 3D topology relations and structures of driving-scene point clouds.
arXiv Detail & Related papers (2020-08-04T13:56:19Z) - Implicit Mesh Reconstruction from Unannotated Image Collections [48.85604987196472]
We present an approach to infer the 3D shape, texture, and camera pose for an object from a single RGB image.
We represent the shape as an image-conditioned implicit function that transforms the surface of a sphere to that of the predicted mesh, while additionally predicting the corresponding texture.
arXiv Detail & Related papers (2020-07-16T17:55:20Z) - 3D Sketch-aware Semantic Scene Completion via Semi-supervised Structure
Prior [50.73148041205675]
The goal of the Semantic Scene Completion (SSC) task is to simultaneously predict a completed 3D voxel representation of volumetric occupancy and semantic labels of objects in the scene from a single-view observation.
We propose to devise a new geometry-based strategy to embed depth information with low-resolution voxel representation.
Our proposed geometric embedding works better than the depth feature learning from habitual SSC frameworks.
arXiv Detail & Related papers (2020-03-31T09:33:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.