Related papers: MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling with Informative-Preserved Reconstruction and Self-Distilled Consistency

MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling with Informative-Preserved Reconstruction and Self-Distilled Consistency

URL: http://arxiv.org/abs/2212.09948v2
Date: Fri, 9 Jun 2023 11:59:32 GMT
Title: MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling with Informative-Preserved Reconstruction and Self-Distilled Consistency
Authors: Mingye Xu, Mutian Xu, Tong He, Wanli Ouyang, Yali Wang, Xiaoguang Han, Yu Qiao
Abstract summary: We propose a novel informative-preserved reconstruction, which explores local statistics to discover and preserve the representative structured points. Our method can concentrate on modeling regional geometry and enjoy less ambiguity for masked reconstruction. By combining informative-preserved reconstruction on masked areas and consistency self-distillation from unmasked areas, a unified framework called MM-3DScene is yielded.
Score: 120.9499803967496
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Masked Modeling (MM) has demonstrated widespread success in various vision challenges, by reconstructing masked visual patches. Yet, applying MM for large-scale 3D scenes remains an open problem due to the data sparsity and scene complexity. The conventional random masking paradigm used in 2D images often causes a high risk of ambiguity when recovering the masked region of 3D scenes. To this end, we propose a novel informative-preserved reconstruction, which explores local statistics to discover and preserve the representative structured points, effectively enhancing the pretext masking task for 3D scene understanding. Integrated with a progressive reconstruction manner, our method can concentrate on modeling regional geometry and enjoy less ambiguity for masked reconstruction. Besides, such scenes with progressive masking ratios can also serve to self-distill their intrinsic spatial consistency, requiring to learn the consistent representations from unmasked areas. By elegantly combining informative-preserved reconstruction on masked areas and consistency self-distillation from unmasked areas, a unified framework called MM-3DScene is yielded. We conduct comprehensive experiments on a host of downstream tasks. The consistent improvement (e.g., +6.1 mAP@0.5 on object detection and +2.2% mIoU on semantic segmentation) demonstrates the superiority of our approach.

Related papers

High-fidelity 3D Gaussian Inpainting: preserving multi-view consistency and photorealistic details [8.279171283542066]
Inpainting 3D scenes remains a challenging task due to the inherent irregularity of 3D structures.<n>We propose a novel 3D Gaussian inpainting framework that reconstructs complete 3D scenes by leveraging sparse inpainted views.<n>Our approach outperforms existing state-of-the-art methods in both visual quality and view consistency.
arXiv Detail & Related papers (2025-07-24T01:48:50Z)
MaskHOI: Robust 3D Hand-Object Interaction Estimation via Masked Pre-training [23.200848479769903]
MaskHOI is a novel Masked Autoencoder-driven pretraining framework for enhanced HOI pose estimation.<n>Our core idea is to leverage the masking-then-reconstruction strategy of MAE to encourage the feature encoder to infer missing spatial and structural information.<n>To enhance the geometric awareness of the pretrained encoder, we introduce a novel Masked Signed Distance Field (SDF)-driven multimodal learning mechanism.
arXiv Detail & Related papers (2025-07-18T05:52:37Z)
Masks make discriminative models great again! [63.877419140833844]
Image2GS is a novel approach to reconstructing 3D scenes from a single image.<n>We create a more deterministic task suitable for discriminative models.<n>Our findings highlight the fundamental struggle discriminative models face when fitting unseen regions.
arXiv Detail & Related papers (2025-07-01T16:22:23Z)
IMFine: 3D Inpainting via Geometry-guided Multi-view Refinement [15.206470606085341]
We introduce a novel approach that produces inpainted 3D scenes with consistent visual quality and coherent underlying geometry. Specifically, we propose a robust 3D inpainting pipeline that incorporates geometric priors and a multi-view refinement network trained via test-time adaptation. We develop a novel inpainting mask detection technique to derive targeted inpainting masks from object masks, boosting the performance in handling unconstrained scenes.
arXiv Detail & Related papers (2025-03-06T14:50:17Z)
XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation [72.12250272218792]
We propose a more meticulous mask-level alignment between 3D features and the 2D-text embedding space through a cross-modal mask reasoning framework, XMask3D. We integrate 3D global features as implicit conditions into the pre-trained 2D denoising UNet, enabling the generation of segmentation masks. The generated 2D masks are employed to align mask-level 3D representations with the vision-language feature space, thereby augmenting the open vocabulary capability of 3D geometry embeddings.
arXiv Detail & Related papers (2024-11-20T12:02:12Z)
Large Spatial Model: End-to-end Unposed Images to Semantic 3D [79.94479633598102]
Large Spatial Model (LSM) processes unposed RGB images directly into semantic radiance fields. LSM simultaneously estimates geometry, appearance, and semantics in a single feed-forward operation. It can generate versatile label maps by interacting with language at novel viewpoints.
arXiv Detail & Related papers (2024-10-24T17:54:42Z)
MOSE: Monocular Semantic Reconstruction Using NeRF-Lifted Noisy Priors [11.118490283303407]
We propose a neural field semantic reconstruction approach to lift inferred image-level noisy priors to 3D. Our method produces accurate semantics and geometry in both 3D and 2D space.
arXiv Detail & Related papers (2024-09-21T05:12:13Z)
Mixed Diffusion for 3D Indoor Scene Synthesis [55.94569112629208]
We present MiDiffusion, a novel mixed discrete-continuous diffusion model designed to synthesize plausible 3D indoor scenes. We show it outperforms state-of-the-art autoregressive and diffusion models in floor-conditioned 3D scene synthesis.
arXiv Detail & Related papers (2024-05-31T17:54:52Z)
Gaga: Group Any Gaussians via 3D-aware Memory Bank [66.54280093684427]
Gaga reconstructs and segments open-world 3D scenes by leveraging inconsistent 2D masks predicted by zero-shot class-agnostic segmentation models. Gaga associates object masks across diverse camera poses through a novel 3D-aware memory bank.
arXiv Detail & Related papers (2024-04-11T17:57:19Z)
Improving Neural Indoor Surface Reconstruction with Mask-Guided Adaptive Consistency Constraints [0.6749750044497732]
We propose a two-stage training process, decouple view-dependent and view-independent colors, and leverage two novel consistency constraints to enhance detail reconstruction performance without requiring extra priors. Experiments on synthetic and real-world datasets show the capability of reducing the interference from prior estimation errors.
arXiv Detail & Related papers (2023-09-18T13:05:23Z)
MaskRenderer: 3D-Infused Multi-Mask Realistic Face Reenactment [0.7673339435080445]
We present a novel end-to-end identity-agnostic face reenactment system, MaskRenderer, that can generate realistic, high fidelity frames in real-time.
arXiv Detail & Related papers (2023-09-10T17:41:46Z)
Mask to reconstruct: Cooperative Semantics Completion for Video-text Retrieval [19.61947785487129]
Mask for Semantics Completion (MASCOT) based on semantic-based masked modeling. Our MASCOT performs state-of-the-art performance on four major text-video retrieval benchmarks.
arXiv Detail & Related papers (2023-05-13T12:31:37Z)
Self-supervised Pre-training with Masked Shape Prediction for 3D Scene Understanding [106.0876425365599]
Masked Shape Prediction (MSP) is a new framework to conduct masked signal modeling in 3D scenes. MSP uses the essential 3D semantic cue, i.e., geometric shape, as the prediction target for masked points.
arXiv Detail & Related papers (2023-05-08T20:09:19Z)
GD-MAE: Generative Decoder for MAE Pre-training on LiDAR Point Clouds [72.60362979456035]
Masked Autoencoders (MAE) are challenging to explore in large-scale 3D point clouds. We propose a textbfGenerative textbfDecoder for MAE (GD-MAE) to automatically merges the surrounding context. We demonstrate the efficacy of the proposed method on several large-scale benchmarks: KITTI, and ONCE.
arXiv Detail & Related papers (2022-12-06T14:32:55Z)
Layered Depth Refinement with Mask Guidance [61.10654666344419]
We formulate a novel problem of mask-guided depth refinement that utilizes a generic mask to refine the depth prediction of SIDE models. Our framework performs layered refinement and inpainting/outpainting, decomposing the depth map into two separate layers signified by the mask and the inverse mask. We empirically show that our method is robust to different types of masks and initial depth predictions, accurately refining depth values in inner and outer mask boundary regions.
arXiv Detail & Related papers (2022-06-07T06:42:44Z)
Topologically Consistent Multi-View Face Inference Using Volumetric Sampling [25.001398662643986]
ToFu is a geometry inference framework that can produce topologically consistent meshes across identities and expressions. A novel progressive mesh generation network embeds the topological structure of the face in a feature volume. These high-quality assets are readily usable by production studios for avatar creation, animation and physically-based skin rendering.
arXiv Detail & Related papers (2021-10-06T17:55:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.