Weakly Supervised 3D Open-vocabulary Segmentation
- URL: http://arxiv.org/abs/2305.14093v4
- Date: Tue, 9 Jan 2024 17:09:47 GMT
- Title: Weakly Supervised 3D Open-vocabulary Segmentation
- Authors: Kunhao Liu, Fangneng Zhan, Jiahui Zhang, Muyu Xu, Yingchen Yu,
Abdulmotaleb El Saddik, Christian Theobalt, Eric Xing, Shijian Lu
- Abstract summary: We tackle the challenges in 3D open-vocabulary segmentation by exploiting pre-trained foundation models CLIP and DINO in a weakly supervised manner.
We distill the open-vocabulary multimodal knowledge and object reasoning capability of CLIP and DINO into a neural radiance field (NeRF)
A notable aspect of our approach is that it does not require any manual segmentation annotations for either the foundation models or the distillation process.
- Score: 104.07740741126119
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Open-vocabulary segmentation of 3D scenes is a fundamental function of human
perception and thus a crucial objective in computer vision research. However,
this task is heavily impeded by the lack of large-scale and diverse 3D
open-vocabulary segmentation datasets for training robust and generalizable
models. Distilling knowledge from pre-trained 2D open-vocabulary segmentation
models helps but it compromises the open-vocabulary feature as the 2D models
are mostly finetuned with close-vocabulary datasets. We tackle the challenges
in 3D open-vocabulary segmentation by exploiting pre-trained foundation models
CLIP and DINO in a weakly supervised manner. Specifically, given only the
open-vocabulary text descriptions of the objects in a scene, we distill the
open-vocabulary multimodal knowledge and object reasoning capability of CLIP
and DINO into a neural radiance field (NeRF), which effectively lifts 2D
features into view-consistent 3D segmentation. A notable aspect of our approach
is that it does not require any manual segmentation annotations for either the
foundation models or the distillation process. Extensive experiments show that
our method even outperforms fully supervised models trained with segmentation
annotations in certain scenes, suggesting that 3D open-vocabulary segmentation
can be effectively learned from 2D images and text-image pairs. Code is
available at \url{https://github.com/Kunhao-Liu/3D-OVS}.
Related papers
- Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation [92.17176311351469]
We tackle open-vocabulary 3D scene understanding by introducing a novel data generation pipeline and training framework.
Our method addresses three critical requirements for effective training: precise 3D region segmentation, comprehensive textual descriptions, and sufficient dataset scale.
Applying this pipeline to multiple 3D scene datasets, we create Mosaic3D-5.6M, a dataset of over 30K annotated scenes with 5.6M mask-text pairs.
arXiv Detail & Related papers (2025-02-04T18:18:50Z) - Search3D: Hierarchical Open-Vocabulary 3D Segmentation [78.47704793095669]
We introduce Search3D, an approach to construct hierarchical open-vocabulary 3D scene representations.
Unlike prior methods, Search3D shifts towards a more flexible open-vocabulary 3D search paradigm.
For systematic evaluation, we contribute a scene-scale open-vocabulary 3D part segmentation benchmark based on MultiScan.
arXiv Detail & Related papers (2024-09-27T03:44:07Z) - Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models [57.37244894146089]
We propose Diff2Scene, which leverages frozen representations from text-image generative models, along with salient-aware and geometric-aware masks, for open-vocabulary 3D semantic segmentation and visual grounding tasks.
We show that it outperforms competitive baselines and achieves significant improvements over state-of-the-art methods.
arXiv Detail & Related papers (2024-07-18T16:20:56Z) - Label-Efficient 3D Brain Segmentation via Complementary 2D Diffusion Models with Orthogonal Views [10.944692719150071]
We propose a novel 3D brain segmentation approach using complementary 2D diffusion models.
Our goal is to achieve reliable segmentation quality without requiring complete labels for each individual subject.
arXiv Detail & Related papers (2024-07-17T06:14:53Z) - Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding [39.55810156545949]
We propose a Multimodal Alignment (DMA) framework to densely co-embed different modalities into a common space.
Our DMA method produces highly competitive open-vocabulary segmentation performance on various indoor and outdoor tasks.
arXiv Detail & Related papers (2024-07-13T05:39:17Z) - GOV-NeSF: Generalizable Open-Vocabulary Neural Semantic Fields [50.68719394443926]
Generalizable Open-Vocabulary Neural Semantic Fields (GOV-NeSF) is a novel approach offering a generalizable implicit representation of 3D scenes with open-vocabulary semantics.
GOV-NeSF exhibits state-of-the-art performance in both 2D and 3D open-vocabulary semantic segmentation.
arXiv Detail & Related papers (2024-04-01T05:19:50Z) - FM-OV3D: Foundation Model-based Cross-modal Knowledge Blending for
Open-Vocabulary 3D Detection [40.965892255504144]
FM-OV3D is a method of Foundation Model-based Cross-modal Knowledge Blending for Open-Vocabulary 3D Detection.
We leverage the knowledge of generative foundation models, including GPT-3 and Stable Diffusion models, and cross-modal discriminative models like CLIP.
Experiments show that our model efficiently learns knowledge from multiple foundation models to enhance the open-vocabulary ability of the 3D model.
arXiv Detail & Related papers (2023-12-22T06:34:23Z) - Generalized Robot 3D Vision-Language Model with Fast Rendering and Pre-Training Vision-Language Alignment [55.11291053011696]
This work presents a framework for dealing with 3D scene understanding when the labeled scenes are quite limited.
To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy.
In the limited reconstruction case, our proposed approach, termed WS3D++, ranks 1st on the large-scale ScanNet benchmark.
arXiv Detail & Related papers (2023-12-01T15:47:04Z) - Panoptic Vision-Language Feature Fields [27.209602602110916]
We propose the first algorithm for open-vocabulary panoptic segmentation in 3D scenes.
Our algorithm learns a semantic feature field of the scene by distilling vision-language features from a pretrained 2D model.
Our method achieves panoptic segmentation performance similar to the state-of-the-art closed-set 3D systems on the HyperSim, ScanNet and Replica dataset.
arXiv Detail & Related papers (2023-09-11T13:41:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.