MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling with
Informative-Preserved Reconstruction and Self-Distilled Consistency
- URL: http://arxiv.org/abs/2212.09948v2
- Date: Fri, 9 Jun 2023 11:59:32 GMT
- Title: MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling with
Informative-Preserved Reconstruction and Self-Distilled Consistency
- Authors: Mingye Xu, Mutian Xu, Tong He, Wanli Ouyang, Yali Wang, Xiaoguang Han,
Yu Qiao
- Abstract summary: We propose a novel informative-preserved reconstruction, which explores local statistics to discover and preserve the representative structured points.
Our method can concentrate on modeling regional geometry and enjoy less ambiguity for masked reconstruction.
By combining informative-preserved reconstruction on masked areas and consistency self-distillation from unmasked areas, a unified framework called MM-3DScene is yielded.
- Score: 120.9499803967496
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Masked Modeling (MM) has demonstrated widespread success in various vision
challenges, by reconstructing masked visual patches. Yet, applying MM for
large-scale 3D scenes remains an open problem due to the data sparsity and
scene complexity. The conventional random masking paradigm used in 2D images
often causes a high risk of ambiguity when recovering the masked region of 3D
scenes. To this end, we propose a novel informative-preserved reconstruction,
which explores local statistics to discover and preserve the representative
structured points, effectively enhancing the pretext masking task for 3D scene
understanding. Integrated with a progressive reconstruction manner, our method
can concentrate on modeling regional geometry and enjoy less ambiguity for
masked reconstruction. Besides, such scenes with progressive masking ratios can
also serve to self-distill their intrinsic spatial consistency, requiring to
learn the consistent representations from unmasked areas. By elegantly
combining informative-preserved reconstruction on masked areas and consistency
self-distillation from unmasked areas, a unified framework called MM-3DScene is
yielded. We conduct comprehensive experiments on a host of downstream tasks.
The consistent improvement (e.g., +6.1 mAP@0.5 on object detection and +2.2%
mIoU on semantic segmentation) demonstrates the superiority of our approach.
Related papers
- Large Spatial Model: End-to-end Unposed Images to Semantic 3D [79.94479633598102]
Large Spatial Model (LSM) processes unposed RGB images directly into semantic radiance fields.
LSM simultaneously estimates geometry, appearance, and semantics in a single feed-forward operation.
It can generate versatile label maps by interacting with language at novel viewpoints.
arXiv Detail & Related papers (2024-10-24T17:54:42Z) - MOSE: Monocular Semantic Reconstruction Using NeRF-Lifted Noisy Priors [11.118490283303407]
We propose a neural field semantic reconstruction approach to lift inferred image-level noisy priors to 3D.
Our method produces accurate semantics and geometry in both 3D and 2D space.
arXiv Detail & Related papers (2024-09-21T05:12:13Z) - Gaga: Group Any Gaussians via 3D-aware Memory Bank [66.54280093684427]
Gaga reconstructs and segments open-world 3D scenes by leveraging inconsistent 2D masks predicted by zero-shot segmentation models.
By eliminating the assumption of continuous view changes in training images, Gaga demonstrates robustness to variations in camera poses.
Gaga performs favorably against state-of-the-art methods, emphasizing its potential for real-world applications.
arXiv Detail & Related papers (2024-04-11T17:57:19Z) - Improving Neural Indoor Surface Reconstruction with Mask-Guided Adaptive
Consistency Constraints [0.6749750044497732]
We propose a two-stage training process, decouple view-dependent and view-independent colors, and leverage two novel consistency constraints to enhance detail reconstruction performance without requiring extra priors.
Experiments on synthetic and real-world datasets show the capability of reducing the interference from prior estimation errors.
arXiv Detail & Related papers (2023-09-18T13:05:23Z) - MaskRenderer: 3D-Infused Multi-Mask Realistic Face Reenactment [0.7673339435080445]
We present a novel end-to-end identity-agnostic face reenactment system, MaskRenderer, that can generate realistic, high fidelity frames in real-time.
arXiv Detail & Related papers (2023-09-10T17:41:46Z) - Mask to reconstruct: Cooperative Semantics Completion for Video-text
Retrieval [19.61947785487129]
Mask for Semantics Completion (MASCOT) based on semantic-based masked modeling.
Our MASCOT performs state-of-the-art performance on four major text-video retrieval benchmarks.
arXiv Detail & Related papers (2023-05-13T12:31:37Z) - Self-supervised Pre-training with Masked Shape Prediction for 3D Scene
Understanding [106.0876425365599]
Masked Shape Prediction (MSP) is a new framework to conduct masked signal modeling in 3D scenes.
MSP uses the essential 3D semantic cue, i.e., geometric shape, as the prediction target for masked points.
arXiv Detail & Related papers (2023-05-08T20:09:19Z) - GD-MAE: Generative Decoder for MAE Pre-training on LiDAR Point Clouds [72.60362979456035]
Masked Autoencoders (MAE) are challenging to explore in large-scale 3D point clouds.
We propose a textbfGenerative textbfDecoder for MAE (GD-MAE) to automatically merges the surrounding context.
We demonstrate the efficacy of the proposed method on several large-scale benchmarks: KITTI, and ONCE.
arXiv Detail & Related papers (2022-12-06T14:32:55Z) - Layered Depth Refinement with Mask Guidance [61.10654666344419]
We formulate a novel problem of mask-guided depth refinement that utilizes a generic mask to refine the depth prediction of SIDE models.
Our framework performs layered refinement and inpainting/outpainting, decomposing the depth map into two separate layers signified by the mask and the inverse mask.
We empirically show that our method is robust to different types of masks and initial depth predictions, accurately refining depth values in inner and outer mask boundary regions.
arXiv Detail & Related papers (2022-06-07T06:42:44Z) - Topologically Consistent Multi-View Face Inference Using Volumetric
Sampling [25.001398662643986]
ToFu is a geometry inference framework that can produce topologically consistent meshes across identities and expressions.
A novel progressive mesh generation network embeds the topological structure of the face in a feature volume.
These high-quality assets are readily usable by production studios for avatar creation, animation and physically-based skin rendering.
arXiv Detail & Related papers (2021-10-06T17:55:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.