PlanQA: A Benchmark for Spatial Reasoning in LLMs using Structured Representations
- URL: http://arxiv.org/abs/2507.07644v1
- Date: Thu, 10 Jul 2025 11:16:48 GMT
- Title: PlanQA: A Benchmark for Spatial Reasoning in LLMs using Structured Representations
- Authors: Fedor Rodionov, Abdelrahman Eldesokey, Michael Birsak, John Femiani, Bernard Ghanem, Peter Wonka,
- Abstract summary: PlanQA is a diagnostic benchmark for evaluating geometric and spatial reasoning in large-language models.<n>The benchmark uncovers diverse question types that test not only metric and topological reasoning but also interior design constraints.
- Score: 75.04864582433879
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We introduce PlanQA, a diagnostic benchmark for evaluating geometric and spatial reasoning in large-language models (LLMs). PlanQA is grounded in structured representations of indoor scenes, such as kitchens, living rooms, and bedrooms, encoded in a symbolic format (e.g., JSON, XML layouts). The benchmark includes diverse question types that test not only metric and topological reasoning (e.g., distance, visibility, shortest paths) but also interior design constraints such as affordance, clearance, balance, and usability. Our results across a variety of frontier open-source and commercial LLMs show that while models may succeed in shallow queries, they often fail to simulate physical constraints, preserve spatial coherence, or generalize under layout perturbation. PlanQA uncovers a clear blind spot in today's LLMs: they do not consistently reason about real-world layouts. We hope that this benchmark inspires new work on language models that can accurately infer and manipulate spatial and geometric properties in practical settings.
Related papers
- Linear Spatial World Models Emerge in Large Language Models [4.9185678564997355]
We investigate whether large language models implicitly encode linear spatial world models.<n>We introduce a formal framework for spatial world models and assess whether such structure emerges in contextual embeddings.<n>Our results provide empirical evidence that LLMs encode linear spatial world models.
arXiv Detail & Related papers (2025-06-03T15:31:00Z) - Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning? [66.88619941063048]
We ask: Are multimodal large language models (MLLMs) ready for omnidirectional spatial reasoning?<n> OSR-Bench is the first benchmark specifically designed for this setting.<n>It includes over 153,000 diverse question-answer pairs grounded in high-fidelity panoramic indoor scene maps.<n>We evaluate eight state-of-the-art MLLMs, including GPT-4o, Gemini 1.5 Pro, and leading open-source models under zero-shot settings.
arXiv Detail & Related papers (2025-05-17T08:48:40Z) - Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models [14.442394137843923]
We present a detailed analysis that first delineates the core elements of spatial reasoning.<n>We then assesses the performance of these models in both synthetic and real-world images.
arXiv Detail & Related papers (2025-03-25T14:34:06Z) - SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability [58.46310813774538]
Large language models (LMLMs) have made remarkable progress in either temporal or spatial localization.<n>However they struggle to perform-temporal video grounding.<n>This limitation stems from two major challenges.<n>We introduce SpaceLM, a MLLMVL endowed with temporal-temporal video grounding.
arXiv Detail & Related papers (2025-03-18T07:40:36Z) - GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks [84.86699025256705]
We present GEOBench-VLM, a benchmark specifically designed to evaluate Vision-Language Models (VLMs) on geospatial tasks.<n>Our benchmark features over 10,000 manually verified instructions and spanning diverse visual conditions, object types, and scales.<n>We evaluate several state-of-the-art VLMs to assess performance on geospatial-specific challenges.
arXiv Detail & Related papers (2024-11-28T18:59:56Z) - Does Spatial Cognition Emerge in Frontier Models? [56.47912101304053]
We present SPACE, a benchmark that systematically evaluates spatial cognition in frontier models.<n>Results suggest that contemporary frontier models fall short of the spatial intelligence of animals.
arXiv Detail & Related papers (2024-10-09T01:41:49Z) - ET-Plan-Bench: Embodied Task-level Planning Benchmark Towards Spatial-Temporal Cognition with Foundation Models [38.89166693142495]
ET-Plan-Bench is a benchmark for embodied task planning using Large Language Models (LLMs)<n>It features a controllable and diverse set of embodied tasks varying in different levels of difficulties and complexities.<n>Our benchmark distinguishes itself as a large-scale, quantifiable, highly automated, and fine-grained diagnostic framework.
arXiv Detail & Related papers (2024-10-02T19:56:38Z) - SpaRC and SpaRP: Spatial Reasoning Characterization and Path Generation for Understanding Spatial Reasoning Capability of Large Language Models [70.01883340129204]
spatial reasoning is a crucial component of both biological and artificial intelligence.
We present a comprehensive study of the capability of current state-of-the-art large language models (LLMs) on spatial reasoning.
arXiv Detail & Related papers (2024-06-07T01:06:34Z) - Can Large Language Models be Good Path Planners? A Benchmark and Investigation on Spatial-temporal Reasoning [9.461626534488117]
Large language models (LLMs) have achieved remarkable success across a wide spectrum of tasks.<n>We propose a new benchmark, termed $textbfP$ath $textbfP$lanning from $textbfN$atural $textbfL$anguage.
arXiv Detail & Related papers (2023-10-05T01:42:16Z) - Learning Models as Functionals of Signed-Distance Fields for
Manipulation Planning [51.74463056899926]
This work proposes an optimization-based manipulation planning framework where the objectives are learned functionals of signed-distance fields that represent objects in the scene.
We show that representing objects as signed-distance fields not only enables to learn and represent a variety of models with higher accuracy compared to point-cloud and occupancy measure representations.
arXiv Detail & Related papers (2021-10-02T12:36:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.