Weakly Supervised 3D Open-vocabulary Segmentation
- URL: http://arxiv.org/abs/2305.14093v4
- Date: Tue, 9 Jan 2024 17:09:47 GMT
- Title: Weakly Supervised 3D Open-vocabulary Segmentation
- Authors: Kunhao Liu, Fangneng Zhan, Jiahui Zhang, Muyu Xu, Yingchen Yu,
Abdulmotaleb El Saddik, Christian Theobalt, Eric Xing, Shijian Lu
- Abstract summary: We tackle the challenges in 3D open-vocabulary segmentation by exploiting pre-trained foundation models CLIP and DINO in a weakly supervised manner.
We distill the open-vocabulary multimodal knowledge and object reasoning capability of CLIP and DINO into a neural radiance field (NeRF)
A notable aspect of our approach is that it does not require any manual segmentation annotations for either the foundation models or the distillation process.
- Score: 104.07740741126119
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Open-vocabulary segmentation of 3D scenes is a fundamental function of human
perception and thus a crucial objective in computer vision research. However,
this task is heavily impeded by the lack of large-scale and diverse 3D
open-vocabulary segmentation datasets for training robust and generalizable
models. Distilling knowledge from pre-trained 2D open-vocabulary segmentation
models helps but it compromises the open-vocabulary feature as the 2D models
are mostly finetuned with close-vocabulary datasets. We tackle the challenges
in 3D open-vocabulary segmentation by exploiting pre-trained foundation models
CLIP and DINO in a weakly supervised manner. Specifically, given only the
open-vocabulary text descriptions of the objects in a scene, we distill the
open-vocabulary multimodal knowledge and object reasoning capability of CLIP
and DINO into a neural radiance field (NeRF), which effectively lifts 2D
features into view-consistent 3D segmentation. A notable aspect of our approach
is that it does not require any manual segmentation annotations for either the
foundation models or the distillation process. Extensive experiments show that
our method even outperforms fully supervised models trained with segmentation
annotations in certain scenes, suggesting that 3D open-vocabulary segmentation
can be effectively learned from 2D images and text-image pairs. Code is
available at \url{https://github.com/Kunhao-Liu/3D-OVS}.
Related papers
- Search3D: Hierarchical Open-Vocabulary 3D Segmentation [78.47704793095669]
Open-vocabulary 3D segmentation enables the exploration of 3D spaces using free-form text descriptions.
We introduce Search3D, an approach that builds a hierarchical open-vocabulary 3D scene representation.
Our method aims to expand the capabilities of open vocabulary instance-level 3D segmentation by shifting towards a more flexible open-vocabulary 3D search setting.
arXiv Detail & Related papers (2024-09-27T03:44:07Z) - Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models [57.37244894146089]
We propose Diff2Scene, which leverages frozen representations from text-image generative models, along with salient-aware and geometric-aware masks, for open-vocabulary 3D semantic segmentation and visual grounding tasks.
We show that it outperforms competitive baselines and achieves significant improvements over state-of-the-art methods.
arXiv Detail & Related papers (2024-07-18T16:20:56Z) - Label-Efficient 3D Brain Segmentation via Complementary 2D Diffusion Models with Orthogonal Views [10.944692719150071]
We propose a novel 3D brain segmentation approach using complementary 2D diffusion models.
Our goal is to achieve reliable segmentation quality without requiring complete labels for each individual subject.
arXiv Detail & Related papers (2024-07-17T06:14:53Z) - Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding [39.55810156545949]
We propose a Multimodal Alignment (DMA) framework to densely co-embed different modalities into a common space.
Our DMA method produces highly competitive open-vocabulary segmentation performance on various indoor and outdoor tasks.
arXiv Detail & Related papers (2024-07-13T05:39:17Z) - GOV-NeSF: Generalizable Open-Vocabulary Neural Semantic Fields [50.68719394443926]
Generalizable Open-Vocabulary Neural Semantic Fields (GOV-NeSF) is a novel approach offering a generalizable implicit representation of 3D scenes with open-vocabulary semantics.
GOV-NeSF exhibits state-of-the-art performance in both 2D and 3D open-vocabulary semantic segmentation.
arXiv Detail & Related papers (2024-04-01T05:19:50Z) - FM-OV3D: Foundation Model-based Cross-modal Knowledge Blending for
Open-Vocabulary 3D Detection [40.965892255504144]
FM-OV3D is a method of Foundation Model-based Cross-modal Knowledge Blending for Open-Vocabulary 3D Detection.
We leverage the knowledge of generative foundation models, including GPT-3 and Stable Diffusion models, and cross-modal discriminative models like CLIP.
Experiments show that our model efficiently learns knowledge from multiple foundation models to enhance the open-vocabulary ability of the 3D model.
arXiv Detail & Related papers (2023-12-22T06:34:23Z) - Leveraging Large-Scale Pretrained Vision Foundation Models for
Label-Efficient 3D Point Cloud Segmentation [67.07112533415116]
We present a novel framework that adapts various foundational models for the 3D point cloud segmentation task.
Our approach involves making initial predictions of 2D semantic masks using different large vision models.
To generate robust 3D semantic pseudo labels, we introduce a semantic label fusion strategy that effectively combines all the results via voting.
arXiv Detail & Related papers (2023-11-03T15:41:15Z) - Panoptic Vision-Language Feature Fields [27.209602602110916]
We propose the first algorithm for open-vocabulary panoptic segmentation in 3D scenes.
Our algorithm learns a semantic feature field of the scene by distilling vision-language features from a pretrained 2D model.
Our method achieves panoptic segmentation performance similar to the state-of-the-art closed-set 3D systems on the HyperSim, ScanNet and Replica dataset.
arXiv Detail & Related papers (2023-09-11T13:41:27Z) - OpenScene: 3D Scene Understanding with Open Vocabularies [73.1411930820683]
Traditional 3D scene understanding approaches rely on labeled 3D datasets to train a model for a single task with supervision.
We propose OpenScene, an alternative approach where a model predicts dense features for 3D scene points that are co-embedded with text and image pixels in CLIP feature space.
This zero-shot approach enables task-agnostic training and open-vocabulary queries.
arXiv Detail & Related papers (2022-11-28T18:58:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.