Related papers: Understanding Space Is Rocket Science -- Only Top Reasoning Models Can Solve Spatial Understanding Tasks

Understanding Space Is Rocket Science -- Only Top Reasoning Models Can Solve Spatial Understanding Tasks

URL: http://arxiv.org/abs/2509.02175v2
Date: Thu, 04 Sep 2025 16:38:44 GMT
Title: Understanding Space Is Rocket Science -- Only Top Reasoning Models Can Solve Spatial Understanding Tasks
Authors: Nils Hoehing, Mayug Maniparambil, Ellen Rushe, Noel E. O'Connor, Anthony Ventresque,
Abstract summary: We propose an open-source contrastive VLM benchmark that tests for spatial relation understanding.<n>RocketScience is comprised of entirely new real-world image-text pairs.<n>Results show a striking lack of spatial relation understanding in open source and frontier commercial VLMs.
Score: 9.23437069873238
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose RocketScience, an open-source contrastive VLM benchmark that tests for spatial relation understanding. It is comprised of entirely new real-world image-text pairs covering mostly relative spatial understanding and the order of objects. The benchmark is designed to be very easy for humans and hard for the current generation of VLMs, and this is empirically verified. Our results show a striking lack of spatial relation understanding in open source and frontier commercial VLMs and a surprisingly high performance of reasoning models. Additionally, we perform a disentanglement analysis to separate the contributions of object localization and spatial reasoning in chain-of-thought-based models and find that the performance on the benchmark is bottlenecked by spatial reasoning and not object localization capabilities. We release the dataset with a CC-BY-4.0 license and make the evaluation code available at: https://github.com/nilshoehing/rocketscience

Related papers

TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation [70.23578202012048]
Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch.<n>We propose TagaVLM (Topology-Aware Global Action reasoning), an end-to-end framework that explicitly injects topological structures into the VLM backbone.<n>To enhance topological node information, an Interleaved Navigation Prompt strengthens node-level visual-text alignment.<n>With the embedded topological graph, the model is capable of global action reasoning, allowing for robust path correction.
arXiv Detail & Related papers (2026-03-03T13:28:07Z)
SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models [60.088066516175026]
We introduce a benchmark designed to evaluate the spatial logical reasoning capabilities of Vision-Language Models (VLMs)<n>We conduct extensive experiments on 41 mainstream VLMs, and the results show that even the most advanced models still struggle with spatial logical reasoning.<n>We propose a method called recursive scene graph assisted reasoning, which leverages visual foundation models to progressively decompose complex scenes into task-relevant scene graphs.
arXiv Detail & Related papers (2026-02-24T13:38:37Z)
From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs [65.04549036809557]
We introduce a benchmark built from pedestrian-perspective videos captured with synchronized stereo cameras, LiDAR, and IMU/GPS sensors.<n>This dataset provides metrically precise 3D information, enabling the automatic generation of spatial reasoning questions.<n> Evaluations reveal that the performance gains observed in structured indoor benchmarks vanish in open-world settings.
arXiv Detail & Related papers (2025-12-22T18:58:12Z)
Spatial Reasoning in Foundation Models: Benchmarking Object-Centric Spatial Understanding [8.202861909913791]
We present a benchmark for object-centric spatial reasoning in foundation models.<n>We find a stable trade-off: detectors such as GroundingDINO and OWLv2 deliver precise boxes with limited relational reasoning.<n>Our study highlights the gap between localization and true spatial understanding, and pointing toward the need for spatially-aware foundation models in the community.
arXiv Detail & Related papers (2025-09-26T06:06:19Z)
LRR-Bench: Left, Right or Rotate? Vision-Language models Still Struggle With Spatial Understanding Tasks [22.011855291417856]
It remains underexplored how Vision-Language Models (VLMs) recognize spatial relationships and perceive spatial movement.<n>In this work, we introduce a spatial evaluation pipeline and construct a corresponding benchmark.<n>We conduct experiments on multiple state-of-the-art VLMs and observe that there is significant room for improvement.
arXiv Detail & Related papers (2025-07-27T08:31:24Z)
PlanQA: A Benchmark for Spatial Reasoning in LLMs using Structured Representations [75.04864582433879]
PlanQA is a diagnostic benchmark for evaluating geometric and spatial reasoning in large-language models.<n>The benchmark uncovers diverse question types that test not only metric and topological reasoning but also interior design constraints.
arXiv Detail & Related papers (2025-07-10T11:16:48Z)
Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models [14.442394137843923]
We present a detailed analysis that first delineates the core elements of spatial reasoning.<n>We then assesses the performance of these models in both synthetic and real-world images.
arXiv Detail & Related papers (2025-03-25T14:34:06Z)
Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas [52.478956204238315]
We study the spatial reasoning challenge from the lens of mechanistic interpretability.<n>We observe that successful spatial reasoning correlates strongly with the model's ability to align its attention with actual object locations.<n>Motivated by these findings, we propose ADAPTVIS to sharpen the attention on highly relevant regions when confident.
arXiv Detail & Related papers (2025-03-03T17:57:03Z)
Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Spatial Reasoning [36.588008658084895]
Vision language models (VLMs) perform well on many tasks but often fail at spatial reasoning.<n>Our evaluation shows that state-of-the-art VLMs give implausible or incorrect answers to composite spatial problems.<n>We enhance 2D spatial reasoning in VLMs by training them only on basic spatial capabilities.
arXiv Detail & Related papers (2024-10-21T16:26:09Z)
Does Spatial Cognition Emerge in Frontier Models? [56.47912101304053]
We present SPACE, a benchmark that systematically evaluates spatial cognition in frontier models.<n>Results suggest that contemporary frontier models fall short of the spatial intelligence of animals.
arXiv Detail & Related papers (2024-10-09T01:41:49Z)
GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs [3.2688425993442696]
The ability to understand and reason about spatial relationships between objects in images is an important component of visual reasoning. We extend the previously released What'sUp dataset and propose a novel comprehensive evaluation for spatial relationship understanding.
arXiv Detail & Related papers (2024-06-19T06:15:26Z)
SpaRC and SpaRP: Spatial Reasoning Characterization and Path Generation for Understanding Spatial Reasoning Capability of Large Language Models [70.01883340129204]
spatial reasoning is a crucial component of both biological and artificial intelligence. We present a comprehensive study of the capability of current state-of-the-art large language models (LLMs) on spatial reasoning.
arXiv Detail & Related papers (2024-06-07T01:06:34Z)
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities [59.39858959066982]
understanding and reasoning about spatial relationships is a fundamental capability for Visual Question Answering (VQA) and robotics. We develop an automatic 3D spatial VQA data generation framework that scales up to 2 billion VQA examples on 10 million real-world images. By training a VLM on such data, we significantly enhance its ability on both qualitative and quantitative spatial VQA.
arXiv Detail & Related papers (2024-01-22T18:01:01Z)
What's "up" with vision-language models? Investigating their struggle with spatial reasoning [76.2406963762722]
Three new corpora quantify model comprehension of basic spatial relations. We evaluate 18 vision-language (VL) models, finding that all perform poorly. We conclude by studying causes of this surprising behavior.
arXiv Detail & Related papers (2023-10-30T17:50:15Z)
Visual Spatial Reasoning [35.5155400193075]
We present a dataset containing more than 10k natural text-image pairs with 66 types of spatial relations in English. We show how the dataset includes challenging linguistic phenomena, such as varying reference frames. We demonstrate a large gap between human and model performance: the human ceiling is above 95%, while state-of-the-art models only achieve around 70%.
arXiv Detail & Related papers (2022-04-30T23:03:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.