Related papers: Thinking in Structures: Evaluating Spatial Intelligence through Reasoning on Constrained Manifolds

Thinking in Structures: Evaluating Spatial Intelligence through Reasoning on Constrained Manifolds

URL: http://arxiv.org/abs/2602.07864v1
Date: Sun, 08 Feb 2026 08:29:38 GMT
Title: Thinking in Structures: Evaluating Spatial Intelligence through Reasoning on Constrained Manifolds
Authors: Chen Yang, Guanxin Lin, Youquan He, Peiyao Chen, Guanghe Liu, Yufan Mo, Zhouyuan Xu, Linhao Wang, Guohui Zhang, Zihang Zhang, Shenxiang Zeng, Chen Wang, Jiansheng Fan,
Abstract summary: SSI-Bench is a benchmark for spatial reasoning on constrained 3D structures.<n>Ten researchers spent over 400 hours curating images, annotating structural components, and designing questions to minimize pixel-level cues.<n>The best open-source model achieves 22.2% accuracy and the strongest closed-source model reaches 33.6%, while humans score 91.6%.
Score: 6.062002698657217
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Spatial intelligence is crucial for vision--language models (VLMs) in the physical world, yet many benchmarks evaluate largely unconstrained scenes where models can exploit 2D shortcuts. We introduce SSI-Bench, a VQA benchmark for spatial reasoning on constrained manifolds, built from complex real-world 3D structures whose feasible configurations are tightly governed by geometric, topological, and physical constraints. SSI-Bench contains 1,000 ranking questions spanning geometric and topological reasoning and requiring a diverse repertoire of compositional spatial operations, such as mental rotation, cross-sectional inference, occlusion reasoning, and force-path reasoning. It is created via a fully human-centered pipeline: ten researchers spent over 400 hours curating images, annotating structural components, and designing questions to minimize pixel-level cues. Evaluating 31 widely used VLMs reveals a large gap to humans: the best open-source model achieves 22.2% accuracy and the strongest closed-source model reaches 33.6%, while humans score 91.6%. Encouraging models to think yields only marginal gains, and error analysis points to failures in structural grounding and constraint-consistent 3D reasoning. Project page: https://ssi-bench.github.io.

Related papers

SSR: Pushing the Limit of Spatial Intelligence with Structured Scene Reasoning [30.87517633729756]
SSR is a framework designed for Structured Scene Reasoning.<n>It seamlessly integrates 2D and 3D representations via a lightweight alignment mechanism.<n>It achieves state-of-the-art performance on multiple spatial intelligence benchmarks.
arXiv Detail & Related papers (2026-02-28T02:05:35Z)
How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective [103.44502230776352]
We present a systematic investigation of Visual Spatial Reasoning (VSR) in Vision-Language Models (VLMs)<n>We categorize spatial intelligence into three levels of capability, ie, basic perception, spatial understanding, spatial planning, and curate SIBench, a spatial intelligence benchmark encompassing nearly 20 open-source datasets across 23 task settings.
arXiv Detail & Related papers (2025-09-23T12:00:14Z)
SeqAffordSplat: Scene-level Sequential Affordance Reasoning on 3D Gaussian Splatting [85.87902260102652]
We introduce the novel task of Sequential 3D Gaussian Affordance Reasoning.<n>We then propose SeqSplatNet, an end-to-end framework that directly maps an instruction to a sequence of 3D affordance masks.<n>Our method sets a new state-of-the-art on our challenging benchmark, effectively advancing affordance reasoning from single-step interactions to complex, sequential tasks at the scene level.
arXiv Detail & Related papers (2025-07-31T17:56:55Z)
SURPRISE3D: A Dataset for Spatial Understanding and Reasoning in Complex 3D Scenes [105.8644620467576]
We introduce Stextscurprise3D, a novel dataset designed to evaluate language-guided spatial reasoning segmentation in complex 3D scenes.<n>Stextscurprise3D consists of more than 200k vision language pairs across 900+ detailed indoor scenes from ScanNet++ v2.<n>The dataset contains 89k+ human-annotated spatial queries deliberately crafted without object name.
arXiv Detail & Related papers (2025-07-10T14:01:24Z)
PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly [77.33429729761596]
We introduce PhyBlock, a progressive benchmark to assess vision-language models (VLMs) on physical understanding and planning.<n>PhyBlock integrates a novel four-level cognitive hierarchy assembly task alongside targeted Visual Question Answering (VQA) samples.<n>We benchmark 21 state-of-the-art VLMs, highlighting their strengths and limitations in physically grounded, multi-step planning.
arXiv Detail & Related papers (2025-06-10T11:46:06Z)
E3D-Bench: A Benchmark for End-to-End 3D Geometric Foundation Models [78.1674905950243]
We present the first comprehensive benchmark for 3D geometric foundation models (GFMs)<n>GFMs directly predict dense 3D representations in a single feed-forward pass, eliminating the need for slow or unavailable precomputed camera parameters.<n>We evaluate 16 state-of-the-art GFMs, revealing their strengths and limitations across tasks and domains.<n>All code, evaluation scripts, and processed data will be publicly released to accelerate research in 3D spatial intelligence.
arXiv Detail & Related papers (2025-06-02T17:53:09Z)
MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence [74.51213082084428]
MMSI-Bench is a VQA benchmark dedicated to multi-image spatial intelligence.<n>We conduct extensive experiments and thoroughly evaluate 34 open-source and proprietary MLLMs.<n>The strongest open-source model attains roughly 30% accuracy and OpenAI's o3 reasoning model reaches 40%, while humans score 97%.
arXiv Detail & Related papers (2025-05-29T17:59:52Z)
Jigsaw-Puzzles: From Seeing to Understanding to Reasoning in Vision-Language Models [12.945689517235264]
We introduce Jigsaw-Puzzles, a novel benchmark consisting of 1,100 carefully curated real-world images with high spatial complexity.<n>Based on this dataset, we design five tasks to rigorously evaluate vision-language models' spatial perception, structural understanding, and reasoning capabilities.<n>The results show that even the strongest model, Gemini-2.5-Pro, achieves only 77.14% overall accuracy and performs particularly poorly on the Order Generation task.
arXiv Detail & Related papers (2025-05-27T05:17:41Z)
Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models [8.499125564147834]
We present a scalable and unbiased synthetic dataset designed with 4 key capability for spatial reasoning.<n>We develop a cascading evaluation structure, constructing 7 question types across 5 difficulty levels.<n>We observe a general decline in performance as task increases complexity, particularly in 3D reasoning and 6D spatial tasks.
arXiv Detail & Related papers (2025-02-12T18:53:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.