Related papers: SpatialTree: How Spatial Abilities Branch Out in MLLMs

SpatialTree: How Spatial Abilities Branch Out in MLLMs

URL: http://arxiv.org/abs/2512.20617v1
Date: Tue, 23 Dec 2025 18:59:46 GMT
Title: SpatialTree: How Spatial Abilities Branch Out in MLLMs
Authors: Yuxi Xiao, Longfei Li, Shen Yan, Xinhang Liu, Sida Peng, Yunchao Wei, Xiaowei Zhou, Bingyi Kang,
Abstract summary: We introduce a cognitive-science-inspired hierarchy that organizes spatial abilities into four levels: low-level perception (L1), mental mapping (L2), simulation (L3), and agentic competence (L4)<n>We find that naive RL that encourages extensive "thinking" is unreliable: it helps complex reasoning but hurts intuitive perception.<n>We propose a simple auto-think strategy that suppresses unnecessary deliberation, enabling RL to consistently improve performance across all levels.
Score: 109.32057088014942
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Cognitive science suggests that spatial ability develops progressively-from perception to reasoning and interaction. Yet in multimodal LLMs (MLLMs), this hierarchy remains poorly understood, as most studies focus on a narrow set of tasks. We introduce SpatialTree, a cognitive-science-inspired hierarchy that organizes spatial abilities into four levels: low-level perception (L1), mental mapping (L2), simulation (L3), and agentic competence (L4). Based on this taxonomy, we construct the first capability-centric hierarchical benchmark, thoroughly evaluating mainstream MLLMs across 27 sub-abilities. The evaluation results reveal a clear structure: L1 skills are largely orthogonal, whereas higher-level skills are strongly correlated, indicating increasing interdependency. Through targeted supervised fine-tuning, we uncover a surprising transfer dynamic-negative transfer within L1, but strong cross-level transfer from low- to high-level abilities with notable synergy. Finally, we explore how to improve the entire hierarchy. We find that naive RL that encourages extensive "thinking" is unreliable: it helps complex reasoning but hurts intuitive perception. We propose a simple auto-think strategy that suppresses unnecessary deliberation, enabling RL to consistently improve performance across all levels. By building SpatialTree, we provide a proof-of-concept framework for understanding and systematically scaling spatial abilities in MLLMs.

Related papers

UniCog: Uncovering Cognitive Abilities of LLMs through Latent Mind Space Analysis [69.50752734049985]
A growing body of research suggests that the cognitive processes of large language models (LLMs) differ fundamentally from those of humans.<n>We propose UniCog, a unified framework that analyzes LLM cognition via a latent mind space.
arXiv Detail & Related papers (2026-01-25T16:19:00Z)
SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition [19.526371771173064]
spatial cognition is fundamental to real-world multimodal intelligence, allowing models to interact with the physical environment.<n>Existing benchmarks often oversimplify spatial cognition, reducing it to a single-dimensional metric.<n>We propose a hierarchical spatial cognition framework that decomposes spatial intelligence into five progressively complex levels.
arXiv Detail & Related papers (2025-11-26T15:04:18Z)
Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning [56.496001894673235]
Reinforcement Learning (RL) has proven highly effective at enhancing the complex reasoning abilities of Large Language Models (LLMs)<n>Our analysis reveals that puzzling phenomena like aha moments", length-scaling'' and entropy dynamics are not disparate occurrences but hallmarks of an emergent reasoning hierarchy.
arXiv Detail & Related papers (2025-09-03T18:52:49Z)
Decoupling Knowledge and Reasoning in LLMs: An Exploration Using Cognitive Dual-System Theory [2.8952499264943445]
Large language models (LLMs) leverage both knowledge and reasoning during inference.<n>We propose a cognition attribution framework to decouple the contribution of knowledge and reasoning.
arXiv Detail & Related papers (2025-07-24T08:24:52Z)
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models [17.976302783133956]
We introduce OmniSpatial, a benchmark for spatial reasoning grounded in cognitive psychology.<n>It covers four major categories: dynamic reasoning, complex spatial logic, spatial interaction, and perspective-taking.<n>Through careful manual annotation, we construct over 8.4K question-answer pairs.
arXiv Detail & Related papers (2025-06-03T17:58:29Z)
Truly Assessing Fluid Intelligence of Large Language Models through Dynamic Reasoning Evaluation [106.17986469245302]
Large language models (LLMs) have demonstrated impressive reasoning capacities that mirror human-like thinking.<n>Existing reasoning benchmarks either focus on domain-specific knowledge (crystallized intelligence) or lack interpretability.<n>We propose DRE-Bench, a dynamic reasoning evaluation benchmark grounded in a hierarchical cognitive framework.
arXiv Detail & Related papers (2025-06-03T09:01:08Z)
Visual Large Language Models Exhibit Human-Level Cognitive Flexibility in the Wisconsin Card Sorting Test [5.346677002840565]
This study assesses the cognitive flexibility of state-of-the-art Visual Large Language Models (VLLMs)<n>Our results reveal that VLLMs achieve or surpass human-level set-shifting capabilities under chain-of-thought prompting with text-based inputs.
arXiv Detail & Related papers (2025-05-28T08:40:55Z)
Kolb-Based Experiential Learning for Generalist Agents with Human-Level Kaggle Data Science Performance [81.05882480184587]
We propose a computational framework of Kolb's learning cycle with Vygotsky's ZPD for autonomous agents.<n>Agent K is the 1st AI system to successfully integrate Kolb- and Vygotsky-inspired human cognitive learning.<n>With 9 gold, 8 silver, and 12 bronze medals level performance - including 4 gold and 4 silver on prize-awarding competitions - Agent K is the 1st AI system to successfully integrate Kolb- and Vygotsky-inspired human cognitive learning.
arXiv Detail & Related papers (2024-11-05T23:55:23Z)
SpaRC and SpaRP: Spatial Reasoning Characterization and Path Generation for Understanding Spatial Reasoning Capability of Large Language Models [70.01883340129204]
spatial reasoning is a crucial component of both biological and artificial intelligence. We present a comprehensive study of the capability of current state-of-the-art large language models (LLMs) on spatial reasoning.
arXiv Detail & Related papers (2024-06-07T01:06:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.