Neural Implicit Vision-Language Feature Fields
- URL: http://arxiv.org/abs/2303.10962v1
- Date: Mon, 20 Mar 2023 09:38:09 GMT
- Title: Neural Implicit Vision-Language Feature Fields
- Authors: Kenneth Blomqvist, Francesco Milano, Jen Jen Chung, Lionel Ott, Roland
Siegwart
- Abstract summary: We present a zero-shot volumetric open-vocabulary semantic scene segmentation method.
Our method builds on the insight that we can fuse image features from a vision-language model into a neural implicit representation.
We show that our method works on noisy real-world data and can run in real-time on live sensor data dynamically adjusting to text prompts.
- Score: 40.248658511361015
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, groundbreaking results have been presented on open-vocabulary
semantic image segmentation. Such methods segment each pixel in an image into
arbitrary categories provided at run-time in the form of text prompts, as
opposed to a fixed set of classes defined at training time. In this work, we
present a zero-shot volumetric open-vocabulary semantic scene segmentation
method. Our method builds on the insight that we can fuse image features from a
vision-language model into a neural implicit representation. We show that the
resulting feature field can be segmented into different classes by assigning
points to natural language text prompts. The implicit volumetric representation
enables us to segment the scene both in 3D and 2D by rendering feature maps
from any given viewpoint of the scene. We show that our method works on noisy
real-world data and can run in real-time on live sensor data dynamically
adjusting to text prompts. We also present quantitative comparisons on the
ScanNet dataset.
Related papers
- Self-supervised Learning of Neural Implicit Feature Fields for Camera Pose Refinement [32.335953514942474]
This paper proposes to jointly learn the scene representation along with a 3D dense feature field and a 2D feature extractor.
We learn the underlying geometry of the scene with an implicit field through volumetric rendering and design our feature field to leverage intermediate geometric information encoded in the implicit field.
Visual localization is then achieved by aligning the image-based features and the rendered volumetric features.
arXiv Detail & Related papers (2024-06-12T17:51:53Z) - Leveraging Open-Vocabulary Diffusion to Camouflaged Instance
Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions.
We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z) - Panoptic Vision-Language Feature Fields [27.209602602110916]
We propose the first algorithm for open-vocabulary panoptic segmentation in 3D scenes.
Our algorithm learns a semantic feature field of the scene by distilling vision-language features from a pretrained 2D model.
Our method achieves panoptic segmentation performance similar to the state-of-the-art closed-set 3D systems on the HyperSim, ScanNet and Replica dataset.
arXiv Detail & Related papers (2023-09-11T13:41:27Z) - Lowis3D: Language-Driven Open-World Instance-Level 3D Scene
Understanding [57.47315482494805]
Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset.
This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories.
We propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for 3D scenes.
arXiv Detail & Related papers (2023-08-01T07:50:14Z) - SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control.
In addition to a global text prompt that describes the entire scene, the user provides a segmentation map.
We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z) - What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding
without Text Inputs [82.93345261434943]
Given an input image, and nothing else, our method returns the bounding boxes of objects in the image and phrases that describe the objects.
This is achieved within an open world paradigm, in which the objects in the input image may not have been encountered during the training of the localization mechanism.
Our work generalizes weakly supervised segmentation and phrase grounding and is shown empirically to outperform the state of the art in both domains.
arXiv Detail & Related papers (2022-06-19T09:07:30Z) - Knowledge Mining with Scene Text for Fine-Grained Recognition [53.74297368412834]
We propose an end-to-end trainable network that mines implicit contextual knowledge behind scene text image.
We employ KnowBert to retrieve relevant knowledge for semantic representation and combine it with image features for fine-grained classification.
Our method outperforms the state-of-the-art by 3.72% mAP and 5.39% mAP, respectively.
arXiv Detail & Related papers (2022-03-27T05:54:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.