Context-Aware Entity Grounding with Open-Vocabulary 3D Scene Graphs
- URL: http://arxiv.org/abs/2309.15940v1
- Date: Wed, 27 Sep 2023 18:32:29 GMT
- Title: Context-Aware Entity Grounding with Open-Vocabulary 3D Scene Graphs
- Authors: Haonan Chang, Kowndinya Boyalakuntla, Shiyang Lu, Siwei Cai, Eric
Jing, Shreesh Keskar, Shijie Geng, Adeeb Abbas, Lifeng Zhou, Kostas Bekris,
Abdeslam Boularias
- Abstract summary: Open-Vocabulary 3D Scene Graph (OVSG) is a formal framework for grounding entities with free-form text-based queries.
In contrast to existing research on 3D scene graphs, OVSG supports free-form text input and open-vocabulary querying.
- Score: 22.499136041727432
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present an Open-Vocabulary 3D Scene Graph (OVSG), a formal framework for
grounding a variety of entities, such as object instances, agents, and regions,
with free-form text-based queries. Unlike conventional semantic-based object
localization approaches, our system facilitates context-aware entity
localization, allowing for queries such as ``pick up a cup on a kitchen table"
or ``navigate to a sofa on which someone is sitting". In contrast to existing
research on 3D scene graphs, OVSG supports free-form text input and
open-vocabulary querying. Through a series of comparative experiments using the
ScanNet dataset and a self-collected dataset, we demonstrate that our proposed
approach significantly surpasses the performance of previous semantic-based
localization techniques. Moreover, we highlight the practical application of
OVSG in real-world robot navigation and manipulation experiments.
Related papers
- DiSCO-3D : Discovering and segmenting Sub-Concepts from Open-vocabulary queries in NeRF [0.5409700620900997]
DiSCO-3D aims to provide a 3D semantic segmentation that adapts to both the scene and user queries.<n>We build DiSCO-3D on Neural Fields representations, combining unsupervised segmentation with weak open-vocabulary guidance.<n>Our evaluations demonstrate that DiSCO-3D achieves effective performance in Open-Vocabulary Sub-concepts Discovery.
arXiv Detail & Related papers (2025-07-19T12:46:20Z) - A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding [78.99798110890157]
Open-vocabulary 3D visual grounding aims to localize target objects based on free-form language queries.<n>Existing language field methods struggle to accurately localize instances using spatial relations in language queries.<n>We propose SpatialReasoner, a novel neural representation-based framework with large language model (LLM)-driven spatial reasoning.
arXiv Detail & Related papers (2025-07-09T10:20:38Z) - OpenSplat3D: Open-Vocabulary 3D Instance Segmentation using Gaussian Splatting [52.40697058096931]
3D Gaussian Splatting (3DGS) has emerged as a powerful representation for neural scene reconstruction.<n>We introduce an approach for open-vocabulary 3D instance segmentation without requiring manual labeling, termed OpenSplat3D.<n>We show results on LERF-mask and LERF-OVS as well as the full ScanNet++ validation set, demonstrating the effectiveness of our approach.
arXiv Detail & Related papers (2025-06-09T12:37:15Z) - OpenFusion++: An Open-vocabulary Real-time Scene Understanding System [4.470499157873342]
We present OpenFusion++, a TSDF-based real-time 3D semantic-geometric reconstruction system.
Our approach refines 3D point clouds by fusing confidence maps from foundational models, dynamically updates global semantic labels via an adaptive cache based on instance area, and employs a dual-path encoding framework.
Experiments on the ICL, Replica, ScanNet, and ScanNet++ datasets demonstrate that OpenFusion++ significantly outperforms the baseline in both semantic accuracy and query responsiveness.
arXiv Detail & Related papers (2025-04-27T14:46:43Z) - IAAO: Interactive Affordance Learning for Articulated Objects in 3D Environments [56.85804719947]
We present IAAO, a framework that builds an explicit 3D model for intelligent agents to gain understanding of articulated objects in their environment through interaction.
We first build hierarchical features and label fields for each object state using 3D Gaussian Splatting (3DGS) by distilling mask features and view-consistent labels from multi-view images.
We then perform object- and part-level queries on the 3D Gaussian primitives to identify static and articulated elements, estimating global transformations and local articulation parameters along with affordances.
arXiv Detail & Related papers (2025-04-09T12:36:48Z) - ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding [9.289977174410824]
3D visual grounding involves localizing entities in a 3D scene referred to by natural language text.
We introduce Visual Grounding with Diverse Language in 3D (ViGiL3D), a diagnostic dataset for evaluating visual grounding methods against a diverse set of language patterns.
arXiv Detail & Related papers (2025-01-02T17:20:41Z) - Open-Vocabulary Octree-Graph for 3D Scene Understanding [54.11828083068082]
Octree-Graph is a novel scene representation for open-vocabulary 3D scene understanding.
An adaptive-octree structure is developed that stores semantics and depicts the occupancy of an object adjustably according to its shape.
arXiv Detail & Related papers (2024-11-25T10:14:10Z) - Multiview Scene Graph [7.460438046915524]
A proper scene representation is central to the pursuit of spatial intelligence.
We propose to build Multiview Scene Graphs (MSG) from unposed images.
MSG represents a scene topologically with interconnected place and object nodes.
arXiv Detail & Related papers (2024-10-15T02:04:05Z) - Search3D: Hierarchical Open-Vocabulary 3D Segmentation [78.47704793095669]
Open-vocabulary 3D segmentation enables the exploration of 3D spaces using free-form text descriptions.
We introduce Search3D, an approach that builds a hierarchical open-vocabulary 3D scene representation.
Our method aims to expand the capabilities of open vocabulary instance-level 3D segmentation by shifting towards a more flexible open-vocabulary 3D search setting.
arXiv Detail & Related papers (2024-09-27T03:44:07Z) - Beyond Bare Queries: Open-Vocabulary Object Grounding with 3D Scene Graph [0.3926357402982764]
We propose a modular approach called BBQ that constructs 3D scene graph representation with metric and semantic edges.
BBQ employs robust DINO-powered associations to construct 3D object-centric map.
We show that BBQ takes a leading place in open-vocabulary 3D semantic segmentation compared to other zero-shot methods.
arXiv Detail & Related papers (2024-06-11T09:57:04Z) - SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph
Attention [19.23636231942245]
We propose a semantic-enhanced relational learning model based on a graph network with our designed memory graph attention layer.
Our method replaces original language-independent encoding with cross-modal encoding in visual analysis.
Experimental results on ReferIt3D and ScanRefer benchmarks show that the proposed method outperforms the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-13T02:11:04Z) - Weakly Supervised Open-Vocabulary Object Detection [31.605276665964787]
We propose a novel weakly supervised open-vocabulary object detection framework, namely WSOVOD, to extend traditional WSOD.
To achieve this, we explore three vital strategies, including dataset-level feature adaptation, image-level salient object localization, and region-level vision-language alignment.
arXiv Detail & Related papers (2023-12-19T18:59:53Z) - ALSTER: A Local Spatio-Temporal Expert for Online 3D Semantic
Reconstruction [62.599588577671796]
We propose an online 3D semantic segmentation method that incrementally reconstructs a 3D semantic map from a stream of RGB-D frames.
Unlike offline methods, ours is directly applicable to scenarios with real-time constraints, such as robotics or mixed reality.
arXiv Detail & Related papers (2023-11-29T20:30:18Z) - Open-Vocabulary Camouflaged Object Segmentation [66.94945066779988]
We introduce a new task, open-vocabulary camouflaged object segmentation (OVCOS)
We construct a large-scale complex scene dataset (textbfOVCamo) containing 11,483 hand-selected images with fine annotations and corresponding object classes.
By integrating the guidance of class semantic knowledge and the supplement of visual structure cues from the edge and depth information, the proposed method can efficiently capture camouflaged objects.
arXiv Detail & Related papers (2023-11-19T06:00:39Z) - OV-VG: A Benchmark for Open-Vocabulary Visual Grounding [33.02137080950678]
This research endeavor introduces novel and challenging open-vocabulary visual tasks.
The overarching aim is to establish connections between language descriptions and the localization of novel objects.
We have curated a benchmark, encompassing 7,272 OV-VG images and 1,000 OV-PL images.
arXiv Detail & Related papers (2023-10-22T17:54:53Z) - Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z) - CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graph
Diffusion [83.30168660888913]
We present CommonScenes, a fully generative model that converts scene graphs into corresponding controllable 3D scenes.
Our pipeline consists of two branches, one predicting the overall scene layout via a variational auto-encoder and the other generating compatible shapes.
The generated scenes can be manipulated by editing the input scene graph and sampling the noise in the diffusion model.
arXiv Detail & Related papers (2023-05-25T17:39:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.