OpenScene: 3D Scene Understanding with Open Vocabularies
- URL: http://arxiv.org/abs/2211.15654v2
- Date: Thu, 6 Apr 2023 15:35:13 GMT
- Title: OpenScene: 3D Scene Understanding with Open Vocabularies
- Authors: Songyou Peng, Kyle Genova, Chiyu "Max" Jiang, Andrea Tagliasacchi,
Marc Pollefeys, Thomas Funkhouser
- Abstract summary: Traditional 3D scene understanding approaches rely on labeled 3D datasets to train a model for a single task with supervision.
We propose OpenScene, an alternative approach where a model predicts dense features for 3D scene points that are co-embedded with text and image pixels in CLIP feature space.
This zero-shot approach enables task-agnostic training and open-vocabulary queries.
- Score: 73.1411930820683
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditional 3D scene understanding approaches rely on labeled 3D datasets to
train a model for a single task with supervision. We propose OpenScene, an
alternative approach where a model predicts dense features for 3D scene points
that are co-embedded with text and image pixels in CLIP feature space. This
zero-shot approach enables task-agnostic training and open-vocabulary queries.
For example, to perform SOTA zero-shot 3D semantic segmentation it first infers
CLIP features for every 3D point and later classifies them based on
similarities to embeddings of arbitrary class labels. More interestingly, it
enables a suite of open-vocabulary scene understanding applications that have
never been done before. For example, it allows a user to enter an arbitrary
text query and then see a heat map indicating which parts of a scene match. Our
approach is effective at identifying objects, materials, affordances,
activities, and room types in complex 3D scenes, all using a single model
trained without any labeled 3D data.
Related papers
- OpenSU3D: Open World 3D Scene Understanding using Foundation Models [2.1262749936758216]
We present a novel, scalable approach for constructing open set, instance-level 3D scene representations.
Existing methods require pre-constructed 3D scenes and face scalability issues due to per-point feature vector learning.
We evaluate our proposed approach on multiple scenes from ScanNet and Replica datasets demonstrating zero-shot generalization capabilities.
arXiv Detail & Related papers (2024-07-19T13:01:12Z) - Open-Vocabulary SAM3D: Understand Any 3D Scene [32.00537984541871]
We introduce OV-SAM3D, a universal framework for open-vocabulary 3D scene understanding.
This framework is designed to perform understanding tasks for any 3D scene without requiring prior knowledge of the scene.
Empirical evaluations conducted on the ScanNet200 and nuScenes datasets demonstrate that our approach surpasses existing open-vocabulary methods in unknown open-world environments.
arXiv Detail & Related papers (2024-05-24T14:07:57Z) - Grounded 3D-LLM with Referent Tokens [58.890058568493096]
We propose Grounded 3D-LLM to consolidate various 3D vision tasks within a unified generative framework.
The model uses scene referent tokens as special noun phrases to reference 3D scenes.
It offers a natural approach for translating 3D vision tasks into language formats using task-specific instruction templates.
arXiv Detail & Related papers (2024-05-16T18:03:41Z) - Open3DSG: Open-Vocabulary 3D Scene Graphs from Point Clouds with Queryable Objects and Open-Set Relationships [15.513180297629546]
We present Open3DSG, an alternative approach to learn 3D scene graph prediction in an open world without requiring labeled scene graph data.
We co-embed the features from a 3D scene graph prediction backbone with the feature space of powerful open world 2D vision language foundation models.
arXiv Detail & Related papers (2024-02-19T16:15:03Z) - Lowis3D: Language-Driven Open-World Instance-Level 3D Scene
Understanding [57.47315482494805]
Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset.
This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories.
We propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for 3D scenes.
arXiv Detail & Related papers (2023-08-01T07:50:14Z) - OpenMask3D: Open-Vocabulary 3D Instance Segmentation [84.58747201179654]
OpenMask3D is a zero-shot approach for open-vocabulary 3D instance segmentation.
Our model aggregates per-mask features via multi-view fusion of CLIP-based image embeddings.
arXiv Detail & Related papers (2023-06-23T17:36:44Z) - SGAligner : 3D Scene Alignment with Scene Graphs [84.01002998166145]
Building 3D scene graphs has emerged as a topic in scene representation for several embodied AI applications.
We focus on the fundamental problem of aligning pairs of 3D scene graphs whose overlap can range from zero to partial.
We propose SGAligner, the first method for aligning pairs of 3D scene graphs that is robust to in-the-wild scenarios.
arXiv Detail & Related papers (2023-04-28T14:39:22Z) - CLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D
Dense CLIP [19.66617835750012]
Training a 3D scene understanding model requires complicated human annotations.
vision-language pre-training models (e.g., CLIP) have shown remarkable open-world reasoning properties.
We propose directly transferring CLIP's feature space to 3D scene understanding model without any form of supervision.
arXiv Detail & Related papers (2023-03-08T17:30:58Z) - Point2Seq: Detecting 3D Objects as Sequences [58.63662049729309]
We present a simple and effective framework, named Point2Seq, for 3D object detection from point clouds.
We view each 3D object as a sequence of words and reformulate the 3D object detection task as decoding words from 3D scenes in an auto-regressive manner.
arXiv Detail & Related papers (2022-03-25T00:20:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.