Bridging the Domain Gap: Self-Supervised 3D Scene Understanding with
Foundation Models
- URL: http://arxiv.org/abs/2305.08776v3
- Date: Thu, 2 Nov 2023 15:21:51 GMT
- Title: Bridging the Domain Gap: Self-Supervised 3D Scene Understanding with
Foundation Models
- Authors: Zhimin Chen, Longlong Jing, Yingwei Li, Bing Li
- Abstract summary: Foundation models have achieved remarkable results in 2D and language tasks like image segmentation, object detection, and visual-language understanding.
Their potential to enrich 3D scene representation learning is largely untapped due to the existence of the domain gap.
We propose an innovative methodology called Bridge3D to address this gap by pre-training 3D models using features, semantic masks, and sourced captions from foundation models.
- Score: 18.315856283440386
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Foundation models have achieved remarkable results in 2D and language tasks
like image segmentation, object detection, and visual-language understanding.
However, their potential to enrich 3D scene representation learning is largely
untapped due to the existence of the domain gap. In this work, we propose an
innovative methodology called Bridge3D to address this gap by pre-training 3D
models using features, semantic masks, and captions sourced from foundation
models. Specifically, our method employs semantic masks from foundation models
to guide the masking and reconstruction process for the masked autoencoder,
enabling more focused attention on foreground representations. Moreover, we
bridge the 3D-text gap at the scene level using image captioning foundation
models, thereby facilitating scene-level knowledge distillation. We further
extend this bridging effort by introducing an innovative object-level knowledge
distillation method that harnesses highly accurate object-level masks and
semantic text data from foundation models. Our methodology significantly
surpasses the performance of existing state-of-the-art methods in 3D object
detection and semantic segmentation tasks. For instance, on the ScanNet
dataset, Bridge3D improves the baseline by a notable margin of 6.3%. Code will
be available at: https://github.com/Zhimin-C/Bridge3D
Related papers
- MOSE: Monocular Semantic Reconstruction Using NeRF-Lifted Noisy Priors [11.118490283303407]
We propose a neural field semantic reconstruction approach to lift inferred image-level noisy priors to 3D.
Our method produces accurate semantics and geometry in both 3D and 2D space.
arXiv Detail & Related papers (2024-09-21T05:12:13Z) - RefMask3D: Language-Guided Transformer for 3D Referring Segmentation [32.11635464720755]
RefMask3D aims to explore the comprehensive multi-modal feature interaction and understanding.
RefMask3D outperforms previous state-of-the-art method by a large margin of 3.16% mIoU on the challenging ScanRefer dataset.
arXiv Detail & Related papers (2024-07-25T17:58:03Z) - Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models [57.37244894146089]
We propose Diff2Scene, which leverages frozen representations from text-image generative models, along with salient-aware and geometric-aware masks, for open-vocabulary 3D semantic segmentation and visual grounding tasks.
We show that it outperforms competitive baselines and achieves significant improvements over state-of-the-art methods.
arXiv Detail & Related papers (2024-07-18T16:20:56Z) - PointSeg: A Training-Free Paradigm for 3D Scene Segmentation via Foundation Models [51.24979014650188]
We present PointSeg, a training-free paradigm that leverages off-the-shelf vision foundation models to address 3D scene perception tasks.
PointSeg can segment anything in 3D scene by acquiring accurate 3D prompts to align their corresponding pixels across frames.
Our approach significantly surpasses the state-of-the-art specialist training-free model by 14.1$%$, 12.3$%$, and 12.6$%$ mAP on ScanNet, ScanNet++, and KITTI-360 datasets.
arXiv Detail & Related papers (2024-03-11T03:28:20Z) - FMGS: Foundation Model Embedded 3D Gaussian Splatting for Holistic 3D Scene Understanding [11.118857208538039]
We present Foundation Model Embedded Gaussian Splatting (S), which incorporates vision-language embeddings of foundation models into 3D Gaussian Splatting (GS)
Results demonstrate remarkable multi-view semantic consistency, facilitating diverse downstream tasks, beating state-of-the-art methods by 10.2 percent on open-vocabulary language-based object detection.
This research explores the intersection of vision, language, and 3D scene representation, paving the way for enhanced scene understanding in uncontrolled real-world environments.
arXiv Detail & Related papers (2024-01-03T20:39:02Z) - 3DStyle-Diffusion: Pursuing Fine-grained Text-driven 3D Stylization with
2D Diffusion Models [102.75875255071246]
3D content creation via text-driven stylization has played a fundamental challenge to multimedia and graphics community.
We propose a new 3DStyle-Diffusion model that triggers fine-grained stylization of 3D meshes with additional controllable appearance and geometric guidance from 2D Diffusion models.
arXiv Detail & Related papers (2023-11-09T15:51:27Z) - Leveraging Large-Scale Pretrained Vision Foundation Models for
Label-Efficient 3D Point Cloud Segmentation [67.07112533415116]
We present a novel framework that adapts various foundational models for the 3D point cloud segmentation task.
Our approach involves making initial predictions of 2D semantic masks using different large vision models.
To generate robust 3D semantic pseudo labels, we introduce a semantic label fusion strategy that effectively combines all the results via voting.
arXiv Detail & Related papers (2023-11-03T15:41:15Z) - PonderV2: Pave the Way for 3D Foundation Model with A Universal
Pre-training Paradigm [114.47216525866435]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.
For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z) - Lowis3D: Language-Driven Open-World Instance-Level 3D Scene
Understanding [57.47315482494805]
Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset.
This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories.
We propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for 3D scenes.
arXiv Detail & Related papers (2023-08-01T07:50:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.