Related papers: OpenSU3D: Open World 3D Scene Understanding using Foundation Models

OpenSU3D: Open World 3D Scene Understanding using Foundation Models

URL: http://arxiv.org/abs/2407.14279v1
Date: Fri, 19 Jul 2024 13:01:12 GMT
Title: OpenSU3D: Open World 3D Scene Understanding using Foundation Models
Authors: Rafay Mohiuddin, Sai Manoj Prakhya, Fiona Collins, Ziyuan Liu, André Borrmann,
Abstract summary: We present a novel, scalable approach for constructing open set, instance-level 3D scene representations. Existing methods require pre-constructed 3D scenes and face scalability issues due to per-point feature vector learning. We evaluate our proposed approach on multiple scenes from ScanNet and Replica datasets demonstrating zero-shot generalization capabilities.
Score: 2.1262749936758216
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we present a novel, scalable approach for constructing open set, instance-level 3D scene representations, advancing open world understanding of 3D environments. Existing methods require pre-constructed 3D scenes and face scalability issues due to per-point feature vector learning, limiting their efficacy with complex queries. Our method overcomes these limitations by incrementally building instance-level 3D scene representations using 2D foundation models, efficiently aggregating instance-level details such as masks, feature vectors, names, and captions. We introduce fusion schemes for feature vectors to enhance their contextual knowledge and performance on complex queries. Additionally, we explore large language models for robust automatic annotation and spatial reasoning tasks. We evaluate our proposed approach on multiple scenes from ScanNet and Replica datasets demonstrating zero-shot generalization capabilities, exceeding current state-of-the-art methods in open world 3D scene understanding.

Related papers

Segment Any 3D-Part in a Scene from a Sentence [50.46950922754459]
This paper aims to achieve the segmentation of any 3D part in a scene based on natural language descriptions.<n>We introduce the 3D-PU dataset, the first large-scale 3D dataset with dense part annotations.<n>On the methodological side, we propose OpenPart3D, a 3D-input-only framework to tackle the challenges of part-level segmentation.
arXiv Detail & Related papers (2025-06-24T05:51:22Z)
Articulate3D: Holistic Understanding of 3D Scenes as Universal Scene Description [56.69740649781989]
3D scene understanding is a long-standing challenge in computer vision and a key component in enabling mixed reality, wearable computing, and embodied AI.<n>We introduce Articulate3D, an expertly curated 3D dataset featuring high-quality manual annotations on 280 indoor scenes.<n>We also present USDNet, a novel unified framework capable of simultaneously predicting part segmentation along with a full specification of motion attributes for articulated objects.
arXiv Detail & Related papers (2024-12-02T11:33:55Z)
OpenScan: A Benchmark for Generalized Open-Vocabulary 3D Scene Understanding [43.69535335079362]
Open-vocabulary 3D scene understanding (OV-3D) aims to localize and classify novel objects beyond the closed object classes. Existing approaches and benchmarks primarily focus on the open vocabulary problem within the context of object classes. We introduce a more challenging task called Generalized Open-Vocabulary 3D Scene Understanding (GOV-3D) to explore the open vocabulary problem beyond object classes.
arXiv Detail & Related papers (2024-08-20T17:31:48Z)
Open-Vocabulary SAM3D: Towards Training-free Open-Vocabulary 3D Scene Understanding [41.96929575241655]
We introduce OV-SAM3D, a training-free method for understanding open-vocabulary 3D scenes. This framework is designed to perform understanding tasks for any 3D scene without requiring prior knowledge of the scene. Empirical evaluations on the ScanNet200 and nuScenes datasets demonstrate that our approach surpasses existing open-vocabulary methods in unknown open-world environments.
arXiv Detail & Related papers (2024-05-24T14:07:57Z)
Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance [49.14140194332482]
We introduce Open3DIS, a novel solution designed to tackle the problem of Open-Vocabulary Instance within 3D scenes. Objects within 3D environments exhibit diverse shapes, scales, and colors, making precise instance-level identification a challenging task.
arXiv Detail & Related papers (2023-12-17T10:07:03Z)
SceneWiz3D: Towards Text-guided 3D Scene Composition [134.71933134180782]
Existing approaches either leverage large text-to-image models to optimize a 3D representation or train 3D generators on object-centric datasets. We introduce SceneWiz3D, a novel approach to synthesize high-fidelity 3D scenes from text.
arXiv Detail & Related papers (2023-12-13T18:59:30Z)
Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers [65.51132104404051]
We introduce the use of object identifiers and object-centric representations to interact with scenes at the object level. Our model significantly outperforms existing methods on benchmarks including ScanRefer, Multi3DRefer, Scan2Cap, ScanQA, and SQA3D.
arXiv Detail & Related papers (2023-12-13T14:27:45Z)
OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation [32.508069732371105]
OpenIns3D is a new 3D-input-only framework for 3D open-vocabulary scene understanding. It achieves state-of-the-art performance across a wide range of 3D open-vocabulary tasks.
arXiv Detail & Related papers (2023-09-01T17:59:56Z)
Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding [57.47315482494805]
Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset. This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories. We propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for 3D scenes.
arXiv Detail & Related papers (2023-08-01T07:50:14Z)
OpenScene: 3D Scene Understanding with Open Vocabularies [73.1411930820683]
Traditional 3D scene understanding approaches rely on labeled 3D datasets to train a model for a single task with supervision. We propose OpenScene, an alternative approach where a model predicts dense features for 3D scene points that are co-embedded with text and image pixels in CLIP feature space. This zero-shot approach enables task-agnostic training and open-vocabulary queries.
arXiv Detail & Related papers (2022-11-28T18:58:36Z)
CMR3D: Contextualized Multi-Stage Refinement for 3D Object Detection [57.44434974289945]
We propose Contextualized Multi-Stage Refinement for 3D Object Detection (CMR3D) framework. Our framework takes a 3D scene as input and strives to explicitly integrate useful contextual information of the scene. In addition to 3D object detection, we investigate the effectiveness of our framework for the problem of 3D object counting.
arXiv Detail & Related papers (2022-09-13T05:26:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.