Related papers: CubeBench: Diagnosing Interactive, Long-Horizon Spatial Reasoning Under Partial Observations

CubeBench: Diagnosing Interactive, Long-Horizon Spatial Reasoning Under Partial Observations

URL: http://arxiv.org/abs/2512.23328v3
Date: Thu, 01 Jan 2026 15:48:39 GMT
Title: CubeBench: Diagnosing Interactive, Long-Horizon Spatial Reasoning Under Partial Observations
Authors: Huan-ang Gao, Zikang Zhang, Tianwei Luo, Kaisen Yang, Xinzhe Juan, Jiahao Qiu, Tianxing Chen, Bingxiang He, Hao Zhao, Hao Zhou, Shilong Liu, Mengdi Wang,
Abstract summary: Large Language Model (LLM) agents, while proficient in the digital realm, face a significant gap in physical-world deployment.<n>We identify three core cognitive challenges hindering this transition: spatial reasoning, long-horizon state tracking via mental simulation, and active exploration under partial observation.<n>We introduce CubeBench, a novel generative benchmark centered on the Rubik's Cube.
Score: 60.51118188315758
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Model (LLM) agents, while proficient in the digital realm, face a significant gap in physical-world deployment due to the challenge of forming and maintaining a robust spatial mental model. We identify three core cognitive challenges hindering this transition: spatial reasoning, long-horizon state tracking via mental simulation, and active exploration under partial observation. To isolate and evaluate these faculties, we introduce CubeBench, a novel generative benchmark centered on the Rubik's Cube. CubeBench uses a three-tiered diagnostic framework that progressively assesses agent capabilities, from foundational state tracking with full symbolic information to active exploration with only partial visual data. Our experiments on leading LLMs reveal critical limitations, including a uniform 0.00% pass rate on all long-horizon tasks, exposing a fundamental failure in long-term planning. We also propose a diagnostic framework to isolate these cognitive bottlenecks by providing external solver tools. By analyzing the failure modes, we provide key insights to guide the development of more physically-grounded intelligent agents.

Related papers

SpatialDreamer: Incentivizing Spatial Reasoning via Active Mental Imagery [64.67498968405327]
SpatialDreamer is a reinforcement learning framework that enables spatial reasoning through a closedloop process of active exploration.<n>GeoPO introduces tree-structured sampling and step-level reward estimation with consistency geometric constraints.
arXiv Detail & Related papers (2025-12-08T17:20:50Z)
Reasoning Path and Latent State Analysis for Multi-view Visual Spatial Reasoning: A Cognitive Science Perspective [17.592210658831902]
spatial reasoning is a core aspect of human intelligence that allows perception, inference and planning in 3D environments.<n>Current vision-language models (VLMs) struggle to maintain geometric coherence and cross-view consistency for spatial reasoning in multi-view settings.<n>We present ReMindView-Bench, a cognitively grounded benchmark for evaluating how VLMs construct, align and maintain spatial mental models across complementary viewpoints.
arXiv Detail & Related papers (2025-12-02T02:21:29Z)
REM: Evaluating LLM Embodied Spatial Reasoning through Multi-Frame Trajectories [19.741468026765062]
We introduce REM (Reasoning over Embodied Multi-Frame Trajectories), a benchmark using controllable 3D environments for embodied spatial reasoning.<n> REM systematically evaluates key aspects like object permanence/distinction, spatial relationships, and numerical tracking across dynamic embodied viewpoints.<n>Our evaluation shows that the best-performing current models exhibit promising overall performance, but become increasingly unreliable at even moderate complexity levels easily handled by humans.
arXiv Detail & Related papers (2025-11-30T05:20:22Z)
LTD-Bench: Evaluating Large Language Models by Letting Them Draw [57.237152905238084]
LTD-Bench is a breakthrough benchmark for large language models (LLMs)<n>It transforms LLM evaluation from abstract scores to directly observable visual outputs by requiring models to generate drawings through dot matrices or executable code.<n> LTD-Bench's visual outputs enable powerful diagnostic analysis, offering a potential approach to investigate model similarity.
arXiv Detail & Related papers (2025-11-04T08:11:23Z)
How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective [103.44502230776352]
We present a systematic investigation of Visual Spatial Reasoning (VSR) in Vision-Language Models (VLMs)<n>We categorize spatial intelligence into three levels of capability, ie, basic perception, spatial understanding, spatial planning, and curate SIBench, a spatial intelligence benchmark encompassing nearly 20 open-source datasets across 23 task settings.
arXiv Detail & Related papers (2025-09-23T12:00:14Z)
SpatialViz-Bench: An MLLM Benchmark for Spatial Visualization [44.427830927596204]
SpatialViz-Bench is a comprehensive benchmark for spatial visualization with 12 tasks across 4 sub-abilities, comprising 1,180 automatically generated problems.<n>Our evaluation of 33 state-of-the-art MLLMs reveals wide performance variations and uncovers counter-intuitive findings.
arXiv Detail & Related papers (2025-07-10T10:27:20Z)
SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks [51.774165536666864]
We introduce SIRI-Bench, a benchmark designed to evaluate Vision-Language Models' structural spatial intelligence.<n>Bench comprises 9,000 video-question-answer triplets, where each problem is embedded in a realistic 3D scene.<n> Experimental results reveal that state-of-the-art VLMs struggle significantly on SIRI-Bench, underscoring the challenge of structural spatial reasoning.
arXiv Detail & Related papers (2025-06-17T13:40:00Z)
PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly [77.33429729761596]
We introduce PhyBlock, a progressive benchmark to assess vision-language models (VLMs) on physical understanding and planning.<n>PhyBlock integrates a novel four-level cognitive hierarchy assembly task alongside targeted Visual Question Answering (VQA) samples.<n>We benchmark 21 state-of-the-art VLMs, highlighting their strengths and limitations in physically grounded, multi-step planning.
arXiv Detail & Related papers (2025-06-10T11:46:06Z)
Reflection-Bench: Evaluating Epistemic Agency in Large Language Models [10.801745760525838]
Epistemic agency is the ability to flexibly construct, adapt, and monitor beliefs about dynamic environments.<n>We propose Reflection-Bench, a benchmark consisting of seven tasks with long-term relevance and minimization of data leakage.<n>Our findings suggest several promising research directions, including enhancing core cognitive functions, improving cross-functional coordination, and developing adaptive processing mechanisms.
arXiv Detail & Related papers (2024-10-21T17:59:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.