Prompt-Guided Spatial Understanding with RGB-D Transformers for Fine-Grained Object Relation Reasoning
- URL: http://arxiv.org/abs/2510.11996v1
- Date: Mon, 13 Oct 2025 22:51:20 GMT
- Title: Prompt-Guided Spatial Understanding with RGB-D Transformers for Fine-Grained Object Relation Reasoning
- Authors: Tanner Muturi, Blessing Agyei Kyem, Joshua Kofi Asamoah, Neema Jakisa Owor, Richard Dyzinela, Andrews Danyo, Yaw Adu-Gyamfi, Armstrong Aboah,
- Abstract summary: We introduce a dedicated spatial reasoning framework for the Physical AI Spatial Intelligence Warehouse dataset introduced in the Track 3 2025 AI City Challenge.<n>Our approach enhances spatial comprehension by embedding mask dimensions in the form of bounding box coordinates directly into the input prompts.<n>Our comprehensive pipeline achieves a final score of 73.0606, placing 4th overall on the public leaderboard.
- Score: 7.670666668651702
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Spatial reasoning in large-scale 3D environments such as warehouses remains a significant challenge for vision-language systems due to scene clutter, occlusions, and the need for precise spatial understanding. Existing models often struggle with generalization in such settings, as they rely heavily on local appearance and lack explicit spatial grounding. In this work, we introduce a dedicated spatial reasoning framework for the Physical AI Spatial Intelligence Warehouse dataset introduced in the Track 3 2025 AI City Challenge. Our approach enhances spatial comprehension by embedding mask dimensions in the form of bounding box coordinates directly into the input prompts, enabling the model to reason over object geometry and layout. We fine-tune the framework across four question categories namely: Distance Estimation, Object Counting, Multi-choice Grounding, and Spatial Relation Inference using task-specific supervision. To further improve consistency with the evaluation system, normalized answers are appended to the GPT response within the training set. Our comprehensive pipeline achieves a final score of 73.0606, placing 4th overall on the public leaderboard. These results demonstrate the effectiveness of structured prompt enrichment and targeted optimization in advancing spatial reasoning for real-world industrial environments.
Related papers
- RSGround-R1: Rethinking Remote Sensing Visual Grounding through Spatial Reasoning [61.84363374647606]
Remote Sensing Visual Grounding (RSVG) aims to localize target objects in large-scale aerial imagery based on natural language descriptions.<n>These descriptions often rely heavily on positional cues, posing unique challenges for Multimodal Large Language Models (MLLMs) in spatial reasoning.<n>We propose a reasoning-guided, position-aware post-training framework, dubbed textbfRSGround-R1, to progressively enhance spatial understanding.
arXiv Detail & Related papers (2026-01-29T12:35:57Z) - Video Spatial Reasoning with Object-Centric 3D Rollout [58.12446467377404]
We propose Object-Centric 3D Rollout (OCR) to enable robust video spatial reasoning.<n>OCR introduces structured perturbations to the 3D geometry of selected objects during training.<n>OCR compels the model to reason holistically across the entire scene.
arXiv Detail & Related papers (2025-11-17T09:53:41Z) - 3dSAGER: Geospatial Entity Resolution over 3D Objects (Technical Report) [7.378893412842889]
3dSAGER is an end-to-end pipeline for geospatial entity resolution over 3D objects.<n>We present a novel, spatial-reference-independent featurization mechanism that captures intricate geometric characteristics of matching pairs.<n>We also propose a new lightweight and interpretable blocking method, BKAFI, that leverages a trained model to efficiently generate high-recall candidate sets.
arXiv Detail & Related papers (2025-11-09T09:35:45Z) - Reasoning in Space via Grounding in the World [28.913518130948244]
We introduce the Grounded-Spatial Reasoner (GS-Reasoner) to explore the effective spatial representations that bridge the gap between them.<n>GS-Reasoner achieves impressive results on 3D visual grounding, which in turn significantly enhances its spatial reasoning capabilities.
arXiv Detail & Related papers (2025-10-15T17:58:08Z) - TinyGiantVLM: A Lightweight Vision-Language Architecture for Spatial Reasoning under Resource Constraints [1.7542461418660966]
We present TinyGiantVLM, a lightweight and modular framework designed for physical spatial reasoning.<n>Our approach encodes both global and region-level features from RGB and depth modalities using pretrained visual backbones.<n>To effectively handle the complexity of high-modality inputs and diverse question types, we incorporate a Mixture-of-Experts (MoE) fusion module.
arXiv Detail & Related papers (2025-08-25T01:36:22Z) - SURPRISE3D: A Dataset for Spatial Understanding and Reasoning in Complex 3D Scenes [105.8644620467576]
We introduce Stextscurprise3D, a novel dataset designed to evaluate language-guided spatial reasoning segmentation in complex 3D scenes.<n>Stextscurprise3D consists of more than 200k vision language pairs across 900+ detailed indoor scenes from ScanNet++ v2.<n>The dataset contains 89k+ human-annotated spatial queries deliberately crafted without object name.
arXiv Detail & Related papers (2025-07-10T14:01:24Z) - Enhancing Spatial Reasoning in Multimodal Large Language Models through Reasoning-based Segmentation [50.81551581148339]
We introduce Relevant Reasoning (R$2$S), a reasoning-based segmentation framework.<n>We also introduce 3D ReasonSeg, a reasoning-based segmentation dataset.<n>Both experiments demonstrate that the R$2$S and 3D ReasonSeg effectively endow 3D point cloud perception with stronger spatial reasoning capabilities.
arXiv Detail & Related papers (2025-06-29T06:58:08Z) - ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models [68.46716645478661]
Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content.<n>Current VLMs excel primarily at egocentric spatial reasoning (from the camera's perspective) but fail to generalize to allocentric viewpoints.<n>We introduce ViewSpatial-Bench, the first comprehensive benchmark designed specifically for multi-viewpoint spatial localization recognition evaluation.
arXiv Detail & Related papers (2025-05-27T17:59:26Z) - Spatial Reasoner: A 3D Inference Pipeline for XR Applications [0.0]
We present a spatial reasoning framework that bridges geometric facts with symbolic predicates and relations to handle key tasks.<n>Its foundation relies on oriented 3D bounding box representations, enhanced by a comprehensive set of spatial predicates.<n>The derived predicates form a spatial knowledge graph and, in combination with a pipeline-based inference model, enable spatial queries and dynamic rule evaluation.
arXiv Detail & Related papers (2025-04-25T14:27:27Z) - Spatial-RAG: Spatial Retrieval Augmented Generation for Real-World Geospatial Reasoning Questions [5.053463027769152]
Spatial-RAG is a Retrieval-Augmented Generation framework designed for geospatial question answering.<n>It integrates structured spatial databases with large language models (LLMs) via a hybrid spatial retriever.<n>It formulates the answering process as a multi-objective optimization over spatial and semantic relevance.
arXiv Detail & Related papers (2025-02-04T01:30:06Z) - SURDS: Benchmarking Spatial Understanding and Reasoning in Driving Scenarios with Vision Language Models [15.50826328938879]
We introduce SURDS, a benchmark designed to evaluate the spatial reasoning capabilities of vision language models (VLMs)<n>Built on the nuScenes dataset, SURDS comprises 41,080 vision-question-answer training instances and 9,250 evaluation samples.<n>We propose a reinforcement learning-based alignment scheme leveraging spatially grounded reward signals.
arXiv Detail & Related papers (2024-11-20T08:14:01Z) - Space3D-Bench: Spatial 3D Question Answering Benchmark [49.259397521459114]
We present Space3D-Bench - a collection of 1000 general spatial questions and answers related to scenes of the Replica dataset.
We provide an assessment system that grades natural language responses based on predefined ground-truth answers.
Finally, we introduce a baseline called RAG3D-Chat integrating the world understanding of foundation models with rich context retrieval.
arXiv Detail & Related papers (2024-08-29T16:05:22Z) - SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models [68.13636352687257]
We introduce Spatial Region GPT (SpatialRGPT) to enhance VLMs' spatial perception and reasoning capabilities.
During inference, when provided with user-specified region proposals, SpatialRGPT can accurately perceive their relative directions and distances.
Our results demonstrate that SpatialRGPT significantly enhances performance in spatial reasoning tasks, both with and without local region prompts.
arXiv Detail & Related papers (2024-06-03T17:59:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.