11Plus-Bench: Demystifying Multimodal LLM Spatial Reasoning with Cognitive-Inspired Analysis
- URL: http://arxiv.org/abs/2508.20068v1
- Date: Wed, 27 Aug 2025 17:22:34 GMT
- Title: 11Plus-Bench: Demystifying Multimodal LLM Spatial Reasoning with Cognitive-Inspired Analysis
- Authors: Chengzu Li, Wenshan Wu, Huanyu Zhang, Qingtao Li, Zeyu Gao, Yan Xia, José Hernández-Orallo, Ivan Vulić, Furu Wei,
- Abstract summary: This work introduces a systematic evaluation framework to assess the spatial reasoning abilities of state-of-the-art MLLMs.<n>Through experiments across 14 MLLMs and human evaluation, we find that current MLLMs exhibit early signs of spatial cognition.<n>These findings highlight both emerging capabilities and limitations in current MLLMs' spatial reasoning capabilities.
- Score: 54.24689751375923
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For human cognitive process, spatial reasoning and perception are closely entangled, yet the nature of this interplay remains underexplored in the evaluation of multimodal large language models (MLLMs). While recent MLLM advancements show impressive performance on reasoning, their capacity for human-like spatial cognition remains an open question. In this work, we introduce a systematic evaluation framework to assess the spatial reasoning abilities of state-of-the-art MLLMs relative to human performance. Central to our work is 11Plus-Bench, a high-quality benchmark derived from realistic standardized spatial aptitude tests. 11Plus-Bench also features fine-grained expert annotations of both perceptual complexity and reasoning process, enabling detailed instance-level analysis of model behavior. Through extensive experiments across 14 MLLMs and human evaluation, we find that current MLLMs exhibit early signs of spatial cognition. Despite a large performance gap compared to humans, MLLMs' cognitive profiles resemble those of humans in that cognitive effort correlates strongly with reasoning-related complexity. However, instance-level performance in MLLMs remains largely random, whereas human correctness is highly predictable and shaped by abstract pattern complexity. These findings highlight both emerging capabilities and limitations in current MLLMs' spatial reasoning capabilities and provide actionable insights for advancing model design.
Related papers
- UniCog: Uncovering Cognitive Abilities of LLMs through Latent Mind Space Analysis [69.50752734049985]
A growing body of research suggests that the cognitive processes of large language models (LLMs) differ fundamentally from those of humans.<n>We propose UniCog, a unified framework that analyzes LLM cognition via a latent mind space.
arXiv Detail & Related papers (2026-01-25T16:19:00Z) - SpatialDreamer: Incentivizing Spatial Reasoning via Active Mental Imagery [64.67498968405327]
SpatialDreamer is a reinforcement learning framework that enables spatial reasoning through a closedloop process of active exploration.<n>GeoPO introduces tree-structured sampling and step-level reward estimation with consistency geometric constraints.
arXiv Detail & Related papers (2025-12-08T17:20:50Z) - SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition [19.526371771173064]
spatial cognition is fundamental to real-world multimodal intelligence, allowing models to interact with the physical environment.<n>Existing benchmarks often oversimplify spatial cognition, reducing it to a single-dimensional metric.<n>We propose a hierarchical spatial cognition framework that decomposes spatial intelligence into five progressively complex levels.
arXiv Detail & Related papers (2025-11-26T15:04:18Z) - MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity [28.797461492275488]
MME-CC is a vision-grounded benchmark that organizes 11 representative reasoning tasks into three fundamental categories of visual information.<n>Based on MME-CC, we conduct extensive experiments over 16 representative MLLMs.<n>We identify common error patterns, including orientation mistakes, fragile cross-view identity persistence, and poor adherence to counterfactual instructions.
arXiv Detail & Related papers (2025-11-05T03:09:16Z) - Large Language Models Show Signs of Alignment with Human Neurocognition During Abstract Reasoning [0.0]
This study investigates whether large language models (LLMs) mirror human neurocognition during abstract reasoning.<n>We compared the performance and neural representations of human participants with those of eight open-source LLMs on an abstract-pattern-completion task.
arXiv Detail & Related papers (2025-08-12T21:38:46Z) - SpatialViz-Bench: Automatically Generated Spatial Visualization Reasoning Tasks for MLLMs [43.82781630267406]
SpatialViz-Bench is a comprehensive benchmark for spatial visualization with 12 tasks across 4 sub-abilities, comprising 1,180 automatically generated problems.<n>Our evaluation of 33 state-of-the-art MLLMs reveals wide performance variations and uncovers counter-intuitive findings.
arXiv Detail & Related papers (2025-07-10T10:27:20Z) - Truly Assessing Fluid Intelligence of Large Language Models through Dynamic Reasoning Evaluation [75.26829371493189]
Large language models (LLMs) have demonstrated impressive reasoning capacities that mirror human-like thinking.<n>Existing reasoning benchmarks either focus on domain-specific knowledge (crystallized intelligence) or lack interpretability.<n>We propose DRE-Bench, a dynamic reasoning evaluation benchmark grounded in a hierarchical cognitive framework.
arXiv Detail & Related papers (2025-06-03T09:01:08Z) - Scaling and Beyond: Advancing Spatial Reasoning in MLLMs Requires New Recipes [84.1059652774853]
Multimodal Large Language Models (MLLMs) have demonstrated impressive performance in general vision-language tasks.<n>Recent studies have exposed critical limitations in their spatial reasoning capabilities.<n>This deficiency in spatial reasoning significantly constrains MLLMs' ability to interact effectively with the physical world.
arXiv Detail & Related papers (2025-04-21T11:48:39Z) - Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs [65.93003087656754]
VisFactor is a benchmark that digitizes 20 vision-centric subtests from a well-established cognitive psychology assessment.<n>We evaluate 20 frontier Multimodal Large Language Models (MLLMs) from GPT, Gemini, Claude, LLaMA, Qwen, and SEED families.<n>The best-performing model achieves a score of only 25.19 out of 100, with consistent failures on tasks such as mental rotation, spatial relation inference, and figure-ground discrimination.
arXiv Detail & Related papers (2025-02-23T04:21:32Z) - Human-like object concept representations emerge naturally in multimodal large language models [24.003766123531545]
We combined behavioral and neuroimaging analyses to explore the relationship between object concept representations in Large Language Models (LLMs) and human cognition.<n>Our findings advance the understanding of machine intelligence and inform the development of more human-like artificial cognitive systems.
arXiv Detail & Related papers (2024-07-01T08:17:19Z) - Unveiling Theory of Mind in Large Language Models: A Parallel to Single
Neurons in the Human Brain [2.5350521110810056]
Large language models (LLMs) have been found to exhibit a certain level of Theory of Mind (ToM)
The precise processes underlying LLM's capacity for ToM or their similarities with that of humans remains largely unknown.
arXiv Detail & Related papers (2023-09-04T15:26:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.