Related papers: FloorplanQA: A Benchmark for Spatial Reasoning in LLMs using Structured Representations

FloorplanQA: A Benchmark for Spatial Reasoning in LLMs using Structured Representations

URL: http://arxiv.org/abs/2507.07644v2
Date: Mon, 06 Oct 2025 12:00:21 GMT
Title: FloorplanQA: A Benchmark for Spatial Reasoning in LLMs using Structured Representations
Authors: Fedor Rodionov, Abdelrahman Eldesokey, Michael Birsak, John Femiani, Bernard Ghanem, Peter Wonka,
Abstract summary: FloorplanQA is a diagnostic benchmark for evaluating spatial reasoning in large-language models.<n>The benchmark covers core spatial tasks, including distance measurement, visibility, path finding, and object placement within constrained spaces.
Score: 78.65988445433844
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We introduce FloorplanQA, a diagnostic benchmark for evaluating spatial reasoning in large-language models (LLMs). FloorplanQA is grounded in structured representations of indoor scenes, such as (e.g., kitchens, living rooms, bedrooms, bathrooms, and others), encoded symbolically in JSON or XML layouts. The benchmark covers core spatial tasks, including distance measurement, visibility, path finding, and object placement within constrained spaces. Our results across a variety of frontier open-source and commercial LLMs reveal that while models may succeed in shallow queries, they often fail to respect physical constraints, preserve spatial coherence, though they remain mostly robust to small spatial perturbations. FloorplanQA uncovers a blind spot in today's LLMs: inconsistent reasoning about indoor layouts. We hope this benchmark inspires new work on language models that can accurately infer and manipulate spatial and geometric properties in practical settings.

Related papers

From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs [65.04549036809557]
We introduce a benchmark built from pedestrian-perspective videos captured with synchronized stereo cameras, LiDAR, and IMU/GPS sensors.<n>This dataset provides metrically precise 3D information, enabling the automatic generation of spatial reasoning questions.<n> Evaluations reveal that the performance gains observed in structured indoor benchmarks vanish in open-world settings.
arXiv Detail & Related papers (2025-12-22T18:58:12Z)
SpatialGeo:Boosting Spatial Reasoning in Multimodal LLMs via Geometry-Semantics Fusion [23.86761713752287]
Multimodal large language models (MLLMs) have achieved significant progress in image and language tasks.<n>Most MLLMs suffer from limited spatial reasoning ability to interpret and infer spatial arrangements in three-dimensional space.<n>We propose a novel vision encoder based on hierarchical fusion of geometry and semantics features, generating spatial-aware visual embedding.
arXiv Detail & Related papers (2025-11-21T15:24:33Z)
Spatial Reasoning in Foundation Models: Benchmarking Object-Centric Spatial Understanding [8.202861909913791]
We present a benchmark for object-centric spatial reasoning in foundation models.<n>We find a stable trade-off: detectors such as GroundingDINO and OWLv2 deliver precise boxes with limited relational reasoning.<n>Our study highlights the gap between localization and true spatial understanding, and pointing toward the need for spatially-aware foundation models in the community.
arXiv Detail & Related papers (2025-09-26T06:06:19Z)
Why Do MLLMs Struggle with Spatial Understanding? A Systematic Analysis from Data to Architecture [16.15618237704827]
We present a systematic analysis of spatial understanding from both data and architectural perspectives.<n>From the data perspective, the performance of spatial understanding converges quickly as the training data increases.<n>From the architectural perspective, we find that spatial understanding relies more heavily on the positional encoding within the visual encoder than within the language model.
arXiv Detail & Related papers (2025-09-02T14:22:43Z)
A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding [78.99798110890157]
Open-vocabulary 3D visual grounding aims to localize target objects based on free-form language queries.<n>Existing language field methods struggle to accurately localize instances using spatial relations in language queries.<n>We propose SpatialReasoner, a novel neural representation-based framework with large language model (LLM)-driven spatial reasoning.
arXiv Detail & Related papers (2025-07-09T10:20:38Z)
Linear Spatial World Models Emerge in Large Language Models [4.9185678564997355]
We investigate whether large language models implicitly encode linear spatial world models.<n>We introduce a formal framework for spatial world models and assess whether such structure emerges in contextual embeddings.<n>Our results provide empirical evidence that LLMs encode linear spatial world models.
arXiv Detail & Related papers (2025-06-03T15:31:00Z)
Can LLMs Learn to Map the World from Local Descriptions? [50.490593949836146]
This study investigates whether Large Language Models (LLMs) can construct coherent global spatial cognition.<n> Experiments conducted in a simulated urban environment demonstrate that LLMs exhibit latent representations aligned with real-world spatial distributions.
arXiv Detail & Related papers (2025-05-27T08:22:58Z)
SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding [64.15606979785355]
Multimodal large language models (MLLMs) have achieved impressive success in question-answering tasks, yet their capabilities for spatial understanding are less explored.<n>This work investigates a critical question: do existing MLLMs possess 3D spatial perception and understanding abilities?
arXiv Detail & Related papers (2025-05-22T17:59:03Z)
Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning? [66.88619941063048]
We ask: Are multimodal large language models (MLLMs) ready for omnidirectional spatial reasoning?<n> OSR-Bench is the first benchmark specifically designed for this setting.<n>It includes over 153,000 diverse question-answer pairs grounded in high-fidelity panoramic indoor scene maps.<n>We evaluate eight state-of-the-art MLLMs, including GPT-4o, Gemini 1.5 Pro, and leading open-source models under zero-shot settings.
arXiv Detail & Related papers (2025-05-17T08:48:40Z)
Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models [14.442394137843923]
We present a detailed analysis that first delineates the core elements of spatial reasoning.<n>We then assesses the performance of these models in both synthetic and real-world images.
arXiv Detail & Related papers (2025-03-25T14:34:06Z)
SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability [58.46310813774538]
Large language models (LMLMs) have made remarkable progress in either temporal or spatial localization.<n>However they struggle to perform-temporal video grounding.<n>This limitation stems from two major challenges.<n>We introduce SpaceLM, a MLLMVL endowed with temporal-temporal video grounding.
arXiv Detail & Related papers (2025-03-18T07:40:36Z)
GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks [84.86699025256705]
We present GEOBench-VLM, a benchmark specifically designed to evaluate Vision-Language Models (VLMs) on geospatial tasks.<n>Our benchmark features over 10,000 manually verified instructions and spanning diverse visual conditions, object types, and scales.<n>We evaluate several state-of-the-art VLMs to assess performance on geospatial-specific challenges.
arXiv Detail & Related papers (2024-11-28T18:59:56Z)
Does Spatial Cognition Emerge in Frontier Models? [56.47912101304053]
We present SPACE, a benchmark that systematically evaluates spatial cognition in frontier models.<n>Results suggest that contemporary frontier models fall short of the spatial intelligence of animals.
arXiv Detail & Related papers (2024-10-09T01:41:49Z)
ET-Plan-Bench: Embodied Task-level Planning Benchmark Towards Spatial-Temporal Cognition with Foundation Models [38.89166693142495]
ET-Plan-Bench is a benchmark for embodied task planning using Large Language Models (LLMs)<n>It features a controllable and diverse set of embodied tasks varying in different levels of difficulties and complexities.<n>Our benchmark distinguishes itself as a large-scale, quantifiable, highly automated, and fine-grained diagnostic framework.
arXiv Detail & Related papers (2024-10-02T19:56:38Z)
SpaRC and SpaRP: Spatial Reasoning Characterization and Path Generation for Understanding Spatial Reasoning Capability of Large Language Models [70.01883340129204]
spatial reasoning is a crucial component of both biological and artificial intelligence. We present a comprehensive study of the capability of current state-of-the-art large language models (LLMs) on spatial reasoning.
arXiv Detail & Related papers (2024-06-07T01:06:34Z)
Can Large Language Models be Good Path Planners? A Benchmark and Investigation on Spatial-temporal Reasoning [9.461626534488117]
Large language models (LLMs) have achieved remarkable success across a wide spectrum of tasks.<n>We propose a new benchmark, termed $textbfP$ath $textbfP$lanning from $textbfN$atural $textbfL$anguage.
arXiv Detail & Related papers (2023-10-05T01:42:16Z)
Learning Models as Functionals of Signed-Distance Fields for Manipulation Planning [51.74463056899926]
This work proposes an optimization-based manipulation planning framework where the objectives are learned functionals of signed-distance fields that represent objects in the scene. We show that representing objects as signed-distance fields not only enables to learn and represent a variety of models with higher accuracy compared to point-cloud and occupancy measure representations.
arXiv Detail & Related papers (2021-10-02T12:36:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.