A Multi-Modal Neuro-Symbolic Approach for Spatial Reasoning-Based Visual Grounding in Robotics
- URL: http://arxiv.org/abs/2510.27033v1
- Date: Thu, 30 Oct 2025 22:40:23 GMT
- Title: A Multi-Modal Neuro-Symbolic Approach for Spatial Reasoning-Based Visual Grounding in Robotics
- Authors: Simindokht Jahangard, Mehrzad Mohammadi, Abhinav Dhall, Hamid Rezatofighi,
- Abstract summary: We propose a novel neuro_symbolic framework that integrates panoramic-image and 3D point cloud information, combining neural perception with symbolic reasoning to explicitly model spatial and logical relationships.<n>Our approach demonstrates superior performance and reliability in crowded, human_built environments while maintaining a lightweight design suitable for robotics and embodied AI applications.
- Score: 20.82362652411105
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual reasoning, particularly spatial reasoning, is a challenging cognitive task that requires understanding object relationships and their interactions within complex environments, especially in robotics domain. Existing vision_language models (VLMs) excel at perception tasks but struggle with fine-grained spatial reasoning due to their implicit, correlation-driven reasoning and reliance solely on images. We propose a novel neuro_symbolic framework that integrates both panoramic-image and 3D point cloud information, combining neural perception with symbolic reasoning to explicitly model spatial and logical relationships. Our framework consists of a perception module for detecting entities and extracting attributes, and a reasoning module that constructs a structured scene graph to support precise, interpretable queries. Evaluated on the JRDB-Reasoning dataset, our approach demonstrates superior performance and reliability in crowded, human_built environments while maintaining a lightweight design suitable for robotics and embodied AI applications.
Related papers
- MosaicThinker: On-Device Visual Spatial Reasoning for Embodied AI via Iterative Construction of Space Representation [11.01583588981339]
We present a new inference-time computing technique for on-device embodied AI, namely emphMosaicThinker.<n>Our basic idea is to integrate fragmented spatial information from multiple frames into a unified space representation of global semantic map, and further guide the VLM's spatial reasoning over the semantic map via a visual prompt.<n>Experiment results show that our technique can greatly enhance the accuracy of cross-frame spatial reasoning on resource-constrained embodied AI devices, over reasoning tasks with diverse types and complexities.
arXiv Detail & Related papers (2026-02-06T06:17:29Z) - Thinking with Blueprints: Assisting Vision-Language Models in Spatial Reasoning via Structured Object Representation [52.605647992080485]
spatial reasoning advances vision-language models from visual perception toward semantic understanding.<n>We integrate the cognitive concept of an object-centric blueprint into spatial reasoning.<n>Our method consistently outperforms existing vision-language models.
arXiv Detail & Related papers (2026-01-05T10:38:26Z) - How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective [103.44502230776352]
We present a systematic investigation of Visual Spatial Reasoning (VSR) in Vision-Language Models (VLMs)<n>We categorize spatial intelligence into three levels of capability, ie, basic perception, spatial understanding, spatial planning, and curate SIBench, a spatial intelligence benchmark encompassing nearly 20 open-source datasets across 23 task settings.
arXiv Detail & Related papers (2025-09-23T12:00:14Z) - Mind Meets Space: Rethinking Agentic Spatial Intelligence from a Neuroscience-inspired Perspective [53.556348738917166]
Recent advances in agentic AI have led to systems capable of autonomous task execution and language-based reasoning.<n>Human spatial intelligence, rooted in integrated multisensory perception, spatial memory, and cognitive maps, enables flexible, context-aware decision-making in unstructured environments.
arXiv Detail & Related papers (2025-09-11T05:23:22Z) - Spatial Understanding from Videos: Structured Prompts Meet Simulation Data [89.77871049500546]
We present a unified framework for enhancing 3D spatial reasoning in pre-trained vision-language models without modifying their architecture.<n>This framework combines SpatialMind, a structured prompting strategy that decomposes complex scenes and questions into interpretable reasoning steps, with ScanForgeQA, a scalable question-answering dataset built from diverse 3D simulation scenes.
arXiv Detail & Related papers (2025-06-04T07:36:33Z) - Assured Autonomy with Neuro-Symbolic Perception [11.246557832016238]
Many state-of-the-art AI models deployed in cyber-physical systems (CPS) are pattern-matchers.<n>With limited security guarantees, there are concerns for their reliability in safety-critical and contested domains.<n>We propose a paradigm shift that imbues data-driven perception models with symbolic structure.
arXiv Detail & Related papers (2025-05-27T15:21:06Z) - MIRAGE: A Multi-modal Benchmark for Spatial Perception, Reasoning, and Intelligence [14.694404760882986]
MIRAGE is a benchmark designed to evaluate models' capabilities in Counting (object attribute recognition), Relation (spatial relational reasoning), and Counting with Relation.<n>By targeting these foundational abilities, MIRAGE provides a pathway toward spatial recognition towardtemporal reasoning in future research.
arXiv Detail & Related papers (2025-05-15T16:08:14Z) - SITE: towards Spatial Intelligence Thorough Evaluation [121.1493852562597]
Spatial intelligence (SI) represents a cognitive ability encompassing the visualization, manipulation, and reasoning about spatial relationships.<n>We introduce SITE, a benchmark dataset towards SI Thorough Evaluation.<n>Our approach to curating the benchmark combines a bottom-up survey about 31 existing datasets and a top-down strategy drawing upon three classification systems in cognitive science.
arXiv Detail & Related papers (2025-05-08T17:45:44Z) - SPHERE: Unveiling Spatial Blind Spots in Vision-Language Models Through Hierarchical Evaluation [7.659514491338669]
Current vision-language models may grasp basic spatial cues but struggle with the multi-dimensional spatial reasoning necessary for human-like understanding and real-world applications.<n>We develop SPHERE, a hierarchical evaluation framework supported by a new human-annotated dataset.<n> Benchmark evaluation of state-of-the-art models reveals significant deficiencies, especially in reasoning about distance and proximity.
arXiv Detail & Related papers (2024-12-17T09:10:55Z) - LOGICSEG: Parsing Visual Semantics with Neural Logic Learning and
Reasoning [73.98142349171552]
LOGICSEG is a holistic visual semantic that integrates neural inductive learning and logic reasoning with both rich data and symbolic knowledge.
During fuzzy logic-based continuous relaxation, logical formulae are grounded onto data and neural computational graphs, hence enabling logic-induced network training.
These designs together make LOGICSEG a general and compact neural-logic machine that is readily integrated into existing segmentation models.
arXiv Detail & Related papers (2023-09-24T05:43:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.