Related papers: Spatial Reasoner: A 3D Inference Pipeline for XR Applications

Spatial Reasoner: A 3D Inference Pipeline for XR Applications

URL: http://arxiv.org/abs/2504.18380v1
Date: Fri, 25 Apr 2025 14:27:27 GMT
Title: Spatial Reasoner: A 3D Inference Pipeline for XR Applications
Authors: Steven Häsler, Philipp Ackermann,
Abstract summary: We present a spatial reasoning framework that bridges geometric facts with symbolic predicates and relations to handle key tasks.<n>Its foundation relies on oriented 3D bounding box representations, enhanced by a comprehensive set of spatial predicates.<n>The derived predicates form a spatial knowledge graph and, in combination with a pipeline-based inference model, enable spatial queries and dynamic rule evaluation.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern extended reality XR systems provide rich analysis of image data and fusion of sensor input and demand AR/VR applications that can reason about 3D scenes in a semantic manner. We present a spatial reasoning framework that bridges geometric facts with symbolic predicates and relations to handle key tasks such as determining how 3D objects are arranged among each other ('on', 'behind', 'near', etc.). Its foundation relies on oriented 3D bounding box representations, enhanced by a comprehensive set of spatial predicates, ranging from topology and connectivity to directionality and orientation, expressed in a formalism related to natural language. The derived predicates form a spatial knowledge graph and, in combination with a pipeline-based inference model, enable spatial queries and dynamic rule evaluation. Implementations for client- and server-side processing demonstrate the framework's capability to efficiently translate geometric data into actionable knowledge, ensuring scalable and technology-independent spatial reasoning in complex 3D environments. The Spatial Reasoner framework is fostering the creation of spatial ontologies, and seamlessly integrates with and therefore enriches machine learning, natural language processing, and rule systems in XR applications.

Related papers

SURPRISE3D: A Dataset for Spatial Understanding and Reasoning in Complex 3D Scenes [105.8644620467576]
We introduce Stextscurprise3D, a novel dataset designed to evaluate language-guided spatial reasoning segmentation in complex 3D scenes.<n>Stextscurprise3D consists of more than 200k vision language pairs across 900+ detailed indoor scenes from ScanNet++ v2.<n>The dataset contains 89k+ human-annotated spatial queries deliberately crafted without object name.
arXiv Detail & Related papers (2025-07-10T14:01:24Z)
A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding [78.99798110890157]
Open-vocabulary 3D visual grounding aims to localize target objects based on free-form language queries.<n>Existing language field methods struggle to accurately localize instances using spatial relations in language queries.<n>We propose SpatialReasoner, a novel neural representation-based framework with large language model (LLM)-driven spatial reasoning.
arXiv Detail & Related papers (2025-07-09T10:20:38Z)
Spatial Understanding from Videos: Structured Prompts Meet Simulation Data [79.52833996220059]
We present a unified framework for enhancing 3D spatial reasoning in pre-trained vision-language models without modifying their architecture.<n>This framework combines SpatialMind, a structured prompting strategy that decomposes complex scenes and questions into interpretable reasoning steps, with ScanForgeQA, a scalable question-answering dataset built from diverse 3D simulation scenes.
arXiv Detail & Related papers (2025-06-04T07:36:33Z)
Agentic 3D Scene Generation with Spatially Contextualized VLMs [67.31920821192323]
We introduce a new paradigm that enables vision-language models to generate, understand, and edit complex 3D environments.<n>We develop an agentic 3D scene generation pipeline in which the VLM iteratively reads from and updates the spatial context.<n>Results show that our framework can handle diverse and challenging inputs, achieving a level of generalization not observed in prior work.
arXiv Detail & Related papers (2025-05-26T15:28:17Z)
SORT3D: Spatial Object-centric Reasoning Toolbox for Zero-Shot 3D Grounding Using Large Language Models [9.568997654206823]
SORT3D is an approach that utilizes rich object attributes from 2D data and merges as-based spatial reasoning toolbox with the ability of large language models. We show that SORT3D achieves state-of-the-art performance on complex view-dependent grounding tasks on two benchmarks. We also implement the pipeline to run real-time on an autonomous vehicle and demonstrate that our approach can be used for object-goal navigation on previously unseen real-world environments.
arXiv Detail & Related papers (2025-04-25T20:24:11Z)
IAAO: Interactive Affordance Learning for Articulated Objects in 3D Environments [56.85804719947]
We present IAAO, a framework that builds an explicit 3D model for intelligent agents to gain understanding of articulated objects in their environment through interaction. We first build hierarchical features and label fields for each object state using 3D Gaussian Splatting (3DGS) by distilling mask features and view-consistent labels from multi-view images. We then perform object- and part-level queries on the 3D Gaussian primitives to identify static and articulated elements, estimating global transformations and local articulation parameters along with affordances.
arXiv Detail & Related papers (2025-04-09T12:36:48Z)
Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models [14.442394137843923]
We present a detailed analysis that first delineates the core elements of spatial reasoning. We then assesses the performance of these models in both synthetic and real-world images.
arXiv Detail & Related papers (2025-03-25T14:34:06Z)
GaussianGraph: 3D Gaussian-based Scene Graph Generation for Open-world Scene Understanding [20.578106363482018]
We propose a novel framework that enhances 3DGS-based scene understanding by integrating semantic clustering and scene graph generation. We introduce a "Control-Follow" clustering strategy, which dynamically adapts to scene scale and feature distribution, avoiding feature compression. We enrich scene representation by integrating object attributes and spatial relations extracted from 2D foundation models.
arXiv Detail & Related papers (2025-03-06T02:36:59Z)
GREAT: Geometry-Intention Collaborative Inference for Open-Vocabulary 3D Object Affordance Grounding [53.42728468191711]
Open-Vocabulary 3D object affordance grounding aims to anticipate action possibilities'' regions on 3D objects with arbitrary instructions.<n>We propose GREAT (GeometRy-intEntion collAboraTive inference) for Open-Vocabulary 3D Object Affordance Grounding.
arXiv Detail & Related papers (2024-11-29T11:23:15Z)
Space3D-Bench: Spatial 3D Question Answering Benchmark [49.259397521459114]
We present Space3D-Bench - a collection of 1000 general spatial questions and answers related to scenes of the Replica dataset. We provide an assessment system that grades natural language responses based on predefined ground-truth answers. Finally, we introduce a baseline called RAG3D-Chat integrating the world understanding of foundation models with rich context retrieval.
arXiv Detail & Related papers (2024-08-29T16:05:22Z)
REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models [67.55362046790512]
Vision-language models lack the ability to correctly reason over spatial relationships. We develop the REVISION framework which improves spatial fidelity in vision-language models. Our results and findings indicate that utilizing rendering-based frameworks is an effective approach for developing spatially-aware models.
arXiv Detail & Related papers (2024-08-05T04:51:46Z)
Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding [50.448520056844885]
We propose a generative Bayesian network to produce diverse synthetic scenes with real-world patterns. A series of experiments robustly display our method's consistent superiority over existing state-of-the-art pre-training approaches.
arXiv Detail & Related papers (2024-06-17T07:43:53Z)
GOI: Find 3D Gaussians of Interest with an Optimizable Open-vocabulary Semantic-space Hyperplane [53.388937705785025]
3D open-vocabulary scene understanding is crucial for advancing augmented reality and robotic applications. We introduce GOI, a framework that integrates semantic features from 2D vision-language foundation models into 3D Gaussian Splatting (3DGS) Our method treats the feature selection process as a hyperplane division within the feature space, retaining only features that are highly relevant to the query.
arXiv Detail & Related papers (2024-05-27T18:57:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.