Panoptic Vision-Language Feature Fields
- URL: http://arxiv.org/abs/2309.05448v2
- Date: Thu, 18 Jan 2024 08:33:35 GMT
- Title: Panoptic Vision-Language Feature Fields
- Authors: Haoran Chen, Kenneth Blomqvist, Francesco Milano and Roland Siegwart
- Abstract summary: We propose the first algorithm for open-vocabulary panoptic segmentation in 3D scenes.
Our algorithm learns a semantic feature field of the scene by distilling vision-language features from a pretrained 2D model.
Our method achieves panoptic segmentation performance similar to the state-of-the-art closed-set 3D systems on the HyperSim, ScanNet and Replica dataset.
- Score: 27.209602602110916
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, methods have been proposed for 3D open-vocabulary semantic
segmentation. Such methods are able to segment scenes into arbitrary classes
based on text descriptions provided during runtime. In this paper, we propose
to the best of our knowledge the first algorithm for open-vocabulary panoptic
segmentation in 3D scenes. Our algorithm, Panoptic Vision-Language Feature
Fields (PVLFF), learns a semantic feature field of the scene by distilling
vision-language features from a pretrained 2D model, and jointly fits an
instance feature field through contrastive learning using 2D instance segments
on input frames. Despite not being trained on the target classes, our method
achieves panoptic segmentation performance similar to the state-of-the-art
closed-set 3D systems on the HyperSim, ScanNet and Replica dataset and
additionally outperforms current 3D open-vocabulary systems in terms of
semantic segmentation. We ablate the components of our method to demonstrate
the effectiveness of our model architecture. Our code will be available at
https://github.com/ethz-asl/pvlff.
Related papers
- Search3D: Hierarchical Open-Vocabulary 3D Segmentation [78.47704793095669]
Open-vocabulary 3D segmentation enables the exploration of 3D spaces using free-form text descriptions.
We introduce Search3D, an approach that builds a hierarchical open-vocabulary 3D scene representation.
Our method aims to expand the capabilities of open vocabulary instance-level 3D segmentation by shifting towards a more flexible open-vocabulary 3D search setting.
arXiv Detail & Related papers (2024-09-27T03:44:07Z) - Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models [57.37244894146089]
We propose Diff2Scene, which leverages frozen representations from text-image generative models, along with salient-aware and geometric-aware masks, for open-vocabulary 3D semantic segmentation and visual grounding tasks.
We show that it outperforms competitive baselines and achieves significant improvements over state-of-the-art methods.
arXiv Detail & Related papers (2024-07-18T16:20:56Z) - 3D Weakly Supervised Semantic Segmentation with 2D Vision-Language Guidance [68.8825501902835]
3DSS-VLG is a weakly supervised approach for 3D Semantic with 2D Vision-Language Guidance.
To the best of our knowledge, this is the first work to investigate 3D weakly supervised semantic segmentation by using the textual semantic information of text category labels.
arXiv Detail & Related papers (2024-07-13T09:39:11Z) - Lowis3D: Language-Driven Open-World Instance-Level 3D Scene
Understanding [57.47315482494805]
Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset.
This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories.
We propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for 3D scenes.
arXiv Detail & Related papers (2023-08-01T07:50:14Z) - Weakly Supervised 3D Open-vocabulary Segmentation [104.07740741126119]
We tackle the challenges in 3D open-vocabulary segmentation by exploiting pre-trained foundation models CLIP and DINO in a weakly supervised manner.
We distill the open-vocabulary multimodal knowledge and object reasoning capability of CLIP and DINO into a neural radiance field (NeRF)
A notable aspect of our approach is that it does not require any manual segmentation annotations for either the foundation models or the distillation process.
arXiv Detail & Related papers (2023-05-23T14:16:49Z) - Neural Implicit Vision-Language Feature Fields [40.248658511361015]
We present a zero-shot volumetric open-vocabulary semantic scene segmentation method.
Our method builds on the insight that we can fuse image features from a vision-language model into a neural implicit representation.
We show that our method works on noisy real-world data and can run in real-time on live sensor data dynamically adjusting to text prompts.
arXiv Detail & Related papers (2023-03-20T09:38:09Z) - OpenScene: 3D Scene Understanding with Open Vocabularies [73.1411930820683]
Traditional 3D scene understanding approaches rely on labeled 3D datasets to train a model for a single task with supervision.
We propose OpenScene, an alternative approach where a model predicts dense features for 3D scene points that are co-embedded with text and image pixels in CLIP feature space.
This zero-shot approach enables task-agnostic training and open-vocabulary queries.
arXiv Detail & Related papers (2022-11-28T18:58:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.