Open-Vocabulary Indoor Object Grounding with 3D Hierarchical Scene Graph
- URL: http://arxiv.org/abs/2507.12123v1
- Date: Wed, 16 Jul 2025 10:47:12 GMT
- Title: Open-Vocabulary Indoor Object Grounding with 3D Hierarchical Scene Graph
- Authors: Sergey Linok, Gleb Naumov,
- Abstract summary: OVIGo-3DHSG represents an extensive indoor environment over a Hierarchical Scene Graph.<n>The hierarchical representation explicitly models spatial relations across floors, rooms, locations, and objects.<n>Our approach demonstrates efficient scene comprehension and robust object grounding compared to existing methods.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose OVIGo-3DHSG method - Open-Vocabulary Indoor Grounding of objects using 3D Hierarchical Scene Graph. OVIGo-3DHSG represents an extensive indoor environment over a Hierarchical Scene Graph derived from sequences of RGB-D frames utilizing a set of open-vocabulary foundation models and sensor data processing. The hierarchical representation explicitly models spatial relations across floors, rooms, locations, and objects. To effectively address complex queries involving spatial reference to other objects, we integrate the hierarchical scene graph with a Large Language Model for multistep reasoning. This integration leverages inter-layer (e.g., room-to-object) and intra-layer (e.g., object-to-object) connections, enhancing spatial contextual understanding. We investigate the semantic and geometry accuracy of hierarchical representation on Habitat Matterport 3D Semantic multi-floor scenes. Our approach demonstrates efficient scene comprehension and robust object grounding compared to existing methods. Overall OVIGo-3DHSG demonstrates strong potential for applications requiring spatial reasoning and understanding of indoor environments. Related materials can be found at https://github.com/linukc/OVIGo-3DHSG.
Related papers
- Agentic 3D Scene Generation with Spatially Contextualized VLMs [67.31920821192323]
We introduce a new paradigm that enables vision-language models to generate, understand, and edit complex 3D environments.<n>We develop an agentic 3D scene generation pipeline in which the VLM iteratively reads from and updates the spatial context.<n>Results show that our framework can handle diverse and challenging inputs, achieving a level of generalization not observed in prior work.
arXiv Detail & Related papers (2025-05-26T15:28:17Z) - Intelligent Spatial Perception by Building Hierarchical 3D Scene Graphs for Indoor Scenarios with the Help of LLMs [4.764379183672723]
This paper introduces a novel system that harnesses the capabilities of Large Language Models (LLMs) to construct hierarchical 3D Scene Graphs for indoor scenarios.<n>The proposed framework constructs 3DSGs consisting of a fundamental layer with rich metric-semantic information, an object layer featuring precise point-cloud representation of object nodes, and higher layers of room, floor, and building nodes.<n>Thanks to the innovative application of LLMs, not only object nodes but also nodes of higher layers, e.g., room nodes, are annotated in an intelligent and accurate manner.
arXiv Detail & Related papers (2025-03-19T10:40:28Z) - TB-HSU: Hierarchical 3D Scene Understanding with Contextual Affordances [20.4157915852084]
We develop a model that learns to structure and vary functional affordance across a 3D hierarchical scene graph.<n>The varying functional affordance is designed to integrate with the varying spatial context of the graph.
arXiv Detail & Related papers (2024-12-07T09:23:17Z) - GREAT: Geometry-Intention Collaborative Inference for Open-Vocabulary 3D Object Affordance Grounding [53.42728468191711]
Open-Vocabulary 3D object affordance grounding aims to anticipate action possibilities'' regions on 3D objects with arbitrary instructions.<n>We propose GREAT (GeometRy-intEntion collAboraTive inference) for Open-Vocabulary 3D Object Affordance Grounding.
arXiv Detail & Related papers (2024-11-29T11:23:15Z) - Open-Vocabulary Octree-Graph for 3D Scene Understanding [54.11828083068082]
Octree-Graph is a novel scene representation for open-vocabulary 3D scene understanding.
An adaptive-octree structure is developed that stores semantics and depicts the occupancy of an object adjustably according to its shape.
arXiv Detail & Related papers (2024-11-25T10:14:10Z) - Generating Visual Spatial Description via Holistic 3D Scene
Understanding [88.99773815159345]
Visual spatial description (VSD) aims to generate texts that describe the spatial relations of the given objects within images.
With an external 3D scene extractor, we obtain the 3D objects and scene features for input images.
We construct a target object-centered 3D spatial scene graph (Go3D-S2G), such that we model the spatial semantics of target objects within the holistic 3D scenes.
arXiv Detail & Related papers (2023-05-19T15:53:56Z) - Semantic and Geometric Modeling with Neural Message Passing in 3D Scene
Graphs for Hierarchical Mechanical Search [48.655167907740136]
We use a 3D scene graph representation to capture the hierarchical, semantic, and geometric aspects of this problem.
We introduce Hierarchical Mechanical Search (HMS), a method that guides an agent's actions towards finding a target object specified with a natural language description.
HMS is evaluated on a novel dataset of 500 3D scene graphs with dense placements of semantically related objects in storage locations.
arXiv Detail & Related papers (2020-12-07T21:04:34Z) - Learning 3D Semantic Scene Graphs from 3D Indoor Reconstructions [94.17683799712397]
We focus on scene graphs, a data structure that organizes the entities of a scene in a graph.
We propose a learned method that regresses a scene graph from the point cloud of a scene.
We show the application of our method in a domain-agnostic retrieval task, where graphs serve as an intermediate representation for 3D-3D and 2D-3D matching.
arXiv Detail & Related papers (2020-04-08T12:25:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.