SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition
- URL: http://arxiv.org/abs/2511.21471v1
- Date: Wed, 26 Nov 2025 15:04:18 GMT
- Title: SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition
- Authors: Peiran Xu, Sudong Wang, Yao Zhu, Jianing Li, Yunjian Zhang,
- Abstract summary: spatial cognition is fundamental to real-world multimodal intelligence, allowing models to interact with the physical environment.<n>Existing benchmarks often oversimplify spatial cognition, reducing it to a single-dimensional metric.<n>We propose a hierarchical spatial cognition framework that decomposes spatial intelligence into five progressively complex levels.
- Score: 19.526371771173064
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spatial cognition is fundamental to real-world multimodal intelligence, allowing models to effectively interact with the physical environment. While multimodal large language models (MLLMs) have made significant strides, existing benchmarks often oversimplify spatial cognition, reducing it to a single-dimensional metric, which fails to capture the hierarchical structure and interdependence of spatial abilities. To address this gap, we propose a hierarchical spatial cognition framework that decomposes spatial intelligence into five progressively complex levels from basic observation to high-level planning. Building upon this taxonomy, we construct SpatialBench, a large-scale, fine-grained benchmark covering 15 tasks aligned with these cognitive levels. To provide a unified evaluation across heterogeneous tasks, we further introduce a high-level capability-oriented metric that reliably assesses a model's overall spatial reasoning ability. Extensive experiments over massive MLLMs reveal distinct performance stratification across cognitive levels: models exhibit strong perceptual grounding yet remain limited in symbolic reasoning, causal inference, and planning. Additional human tests demonstrate that humans perform selective, goal-directed abstraction, while MLLMs tend to over-attend to surface details without coherent spatial intent. Our work establishes the first systematic framework for measuring hierarchical spatial cognition in MLLMs, laying the foundation for future spatially intelligent systems.
Related papers
- SpatialDreamer: Incentivizing Spatial Reasoning via Active Mental Imagery [64.67498968405327]
SpatialDreamer is a reinforcement learning framework that enables spatial reasoning through a closedloop process of active exploration.<n>GeoPO introduces tree-structured sampling and step-level reward estimation with consistency geometric constraints.
arXiv Detail & Related papers (2025-12-08T17:20:50Z) - Spatial-DISE: A Unified Benchmark for Evaluating Spatial Reasoning in Vision-Language Models [21.28937516885804]
We propose a unified benchmark, textbfSpatial-DISE, based on a cognitively grounded taxonomy that categorizes tasks into four fundamental quadrants.<n>To address the issue of data scarcity, we develop a scalable and automated pipeline to generate diverse and verifiable spatial reasoning questions.
arXiv Detail & Related papers (2025-10-15T10:44:01Z) - How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective [103.44502230776352]
We present a systematic investigation of Visual Spatial Reasoning (VSR) in Vision-Language Models (VLMs)<n>We categorize spatial intelligence into three levels of capability, ie, basic perception, spatial understanding, spatial planning, and curate SIBench, a spatial intelligence benchmark encompassing nearly 20 open-source datasets across 23 task settings.
arXiv Detail & Related papers (2025-09-23T12:00:14Z) - 11Plus-Bench: Demystifying Multimodal LLM Spatial Reasoning with Cognitive-Inspired Analysis [54.24689751375923]
This work introduces a systematic evaluation framework to assess the spatial reasoning abilities of state-of-the-art MLLMs.<n>Through experiments across 14 MLLMs and human evaluation, we find that current MLLMs exhibit early signs of spatial cognition.<n>These findings highlight both emerging capabilities and limitations in current MLLMs' spatial reasoning capabilities.
arXiv Detail & Related papers (2025-08-27T17:22:34Z) - SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding [64.15606979785355]
Multimodal large language models (MLLMs) have achieved impressive success in question-answering tasks, yet their capabilities for spatial understanding are less explored.<n>This work investigates a critical question: do existing MLLMs possess 3D spatial perception and understanding abilities?
arXiv Detail & Related papers (2025-05-22T17:59:03Z) - Scaling and Beyond: Advancing Spatial Reasoning in MLLMs Requires New Recipes [84.1059652774853]
Multimodal Large Language Models (MLLMs) have demonstrated impressive performance in general vision-language tasks.<n>Recent studies have exposed critical limitations in their spatial reasoning capabilities.<n>This deficiency in spatial reasoning significantly constrains MLLMs' ability to interact effectively with the physical world.
arXiv Detail & Related papers (2025-04-21T11:48:39Z) - EmbodiedVSR: Dynamic Scene Graph-Guided Chain-of-Thought Reasoning for Visual Spatial Tasks [24.41705039390567]
EmbodiedVSR (Embodied Visual Spatial Reasoning) is a novel framework that integrates dynamic scene graph-guided Chain-of-Thought (CoT) reasoning.<n>Our method enables zero-shot spatial reasoning without task-specific fine-tuning.<n>Experiments demonstrate that our framework significantly outperforms existing MLLM-based methods in accuracy and reasoning coherence.
arXiv Detail & Related papers (2025-03-14T05:06:07Z) - SPHERE: Unveiling Spatial Blind Spots in Vision-Language Models Through Hierarchical Evaluation [7.659514491338669]
Current vision-language models may grasp basic spatial cues but struggle with the multi-dimensional spatial reasoning necessary for human-like understanding and real-world applications.<n>We develop SPHERE, a hierarchical evaluation framework supported by a new human-annotated dataset.<n> Benchmark evaluation of state-of-the-art models reveals significant deficiencies, especially in reasoning about distance and proximity.
arXiv Detail & Related papers (2024-12-17T09:10:55Z) - Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Spatial Reasoning [36.588008658084895]
Vision language models (VLMs) perform well on many tasks but often fail at spatial reasoning.<n>Our evaluation shows that state-of-the-art VLMs give implausible or incorrect answers to composite spatial problems.<n>We enhance 2D spatial reasoning in VLMs by training them only on basic spatial capabilities.
arXiv Detail & Related papers (2024-10-21T16:26:09Z) - SpaRC and SpaRP: Spatial Reasoning Characterization and Path Generation for Understanding Spatial Reasoning Capability of Large Language Models [70.01883340129204]
spatial reasoning is a crucial component of both biological and artificial intelligence.
We present a comprehensive study of the capability of current state-of-the-art large language models (LLMs) on spatial reasoning.
arXiv Detail & Related papers (2024-06-07T01:06:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.