Related papers: From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs

From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs

URL: http://arxiv.org/abs/2512.19683v2
Date: Mon, 29 Dec 2025 03:39:22 GMT
Title: From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs
Authors: Mingrui Wu, Zhaozhi Wang, Fangjinhua Wang, Jiaolong Yang, Marc Pollefeys, Tong Zhang,
Abstract summary: We introduce a benchmark built from pedestrian-perspective videos captured with synchronized stereo cameras, LiDAR, and IMU/GPS sensors.<n>This dataset provides metrically precise 3D information, enabling the automatic generation of spatial reasoning questions.<n> Evaluations reveal that the performance gains observed in structured indoor benchmarks vanish in open-world settings.
Score: 65.04549036809557
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While Multimodal Large Language Models (MLLMs) have achieved impressive performance on semantic tasks, their spatial intelligence--crucial for robust and grounded AI systems--remains underdeveloped. Existing benchmarks fall short of diagnosing this limitation: they either focus on overly simplified qualitative reasoning or rely on domain-specific indoor data, constrained by the lack of outdoor datasets with verifiable metric ground truth. To bridge this gap, we introduce a large-scale benchmark built from pedestrian-perspective videos captured with synchronized stereo cameras, LiDAR, and IMU/GPS sensors. This dataset provides metrically precise 3D information, enabling the automatic generation of spatial reasoning questions that span a hierarchical spectrum--from qualitative relational reasoning to quantitative metric and kinematic understanding. Evaluations reveal that the performance gains observed in structured indoor benchmarks vanish in open-world settings. Further analysis using synthetic abnormal scenes and blinding tests confirms that current MLLMs depend heavily on linguistic priors instead of grounded visual reasoning. Our benchmark thus provides a principled platform for diagnosing these limitations and advancing physically grounded spatial intelligence.

Related papers

Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions [18.455501447828343]
Spatial Intelligence (SI) has predominantly relied on Vision-Language Models (VLMs)<n>We introduce SiT-Bench, a novel benchmark designed to evaluate the SI performance of Large Language Models (LLMs) without pixel-level input.<n>We find that explicit spatial reasoning significantly boosts performance, suggesting that LLMs possess latent world-modeling potential.
arXiv Detail & Related papers (2026-01-07T05:13:52Z)
Scaling Spatial Reasoning in MLLMs through Programmatic Data Synthesis [8.60591720958037]
Vision-Language Models (VLMs) are scalable but structurally rigid, while manual annotation is linguistically diverse but unscalable.<n>We introduce SP-RITE, a novel framework that overcomes this dilemma leveraging simulators and large models.<n>We have curated a dataset encompassing 3 simulators, 11k+ scenes, and 300k+ image/video instruction-tuning pairs.<n>We demonstrate that a VLM trained on our data achieves significant performance gains on multiple spatial benchmarks.
arXiv Detail & Related papers (2025-12-18T06:30:08Z)
Why Do MLLMs Struggle with Spatial Understanding? A Systematic Analysis from Data to Architecture [16.15618237704827]
We present a systematic analysis of spatial understanding from both data and architectural perspectives.<n>From the data perspective, the performance of spatial understanding converges quickly as the training data increases.<n>From the architectural perspective, we find that spatial understanding relies more heavily on the positional encoding within the visual encoder than within the language model.
arXiv Detail & Related papers (2025-09-02T14:22:43Z)
MazeEval: A Benchmark for Testing Sequential Decision-Making in Language Models [0.0679877553227375]
This paper introduces MazeEval, a benchmark designed to isolate and evaluate pure spatial reasoning in Large Language Models.<n>We evaluate eight state-of-the-art LLMs across identical mazes in both English and Icelandic to assess cross-linguistic transfer of spatial abilities.
arXiv Detail & Related papers (2025-07-27T19:33:45Z)
OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding [50.72259772580637]
We introduce OST-Bench, a benchmark designed to evaluate Online Spatio-Temporal understanding from the perspective of an agent.<n>Built on an efficient data collection pipeline, OST-Bench consists of 1.4k scenes and 10k question-answer pairs collected from ScanNet, Matterport3D, and ARKitScenes.<n>We find that both complex-based spatial reasoning demands and long-term memory retrieval requirements significantly drop model performance along two separate axes.
arXiv Detail & Related papers (2025-07-10T17:56:07Z)
FloorplanQA: A Benchmark for Spatial Reasoning in LLMs using Structured Representations [78.65988445433844]
FloorplanQA is a diagnostic benchmark for evaluating spatial reasoning in large-language models.<n>The benchmark covers core spatial tasks, including distance measurement, visibility, path finding, and object placement within constrained spaces.
arXiv Detail & Related papers (2025-07-10T11:16:48Z)
ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models [68.46716645478661]
Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content.<n>Current VLMs excel primarily at egocentric spatial reasoning (from the camera's perspective) but fail to generalize to allocentric viewpoints.<n>We introduce ViewSpatial-Bench, the first comprehensive benchmark designed specifically for multi-viewpoint spatial localization recognition evaluation.
arXiv Detail & Related papers (2025-05-27T17:59:26Z)
SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding [64.15606979785355]
Multimodal large language models (MLLMs) have achieved impressive success in question-answering tasks, yet their capabilities for spatial understanding are less explored.<n>This work investigates a critical question: do existing MLLMs possess 3D spatial perception and understanding abilities?
arXiv Detail & Related papers (2025-05-22T17:59:03Z)
Open3D-VQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space [38.482463743451625]
We present Open3D-VQA, a novel benchmark for evaluating MLLMs' ability to reason about complex spatial relationships from an aerial perspective.<n>The benchmark comprises 73k QA pairs spanning 7 general spatial reasoning tasks, including multiple-choice, true/false, and short-answer formats.
arXiv Detail & Related papers (2025-03-14T05:35:38Z)
MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan.<n>The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.