SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs
- URL: http://arxiv.org/abs/2509.25390v1
- Date: Mon, 29 Sep 2025 18:48:16 GMT
- Title: SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs
- Authors: Yuyou Zhang, Radu Corcodel, Chiori Hori, Anoop Cherian, Ding Zhao,
- Abstract summary: We present SpinBench, a diagnostic benchmark for evaluating spatial reasoning in vision language models (VLMs)<n>Since perspective taking requires multiple cognitive capabilities, SpinBench introduces a set of fine-grained diagnostic categories.<n>Results reveal systematic weaknesses: strong egocentric bias, poor rotational understanding, and inconsistencies under symmetrical and syntactic reformulations.
- Score: 49.106901743548036
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We present SpinBench, a cognitively grounded diagnostic benchmark for evaluating spatial reasoning in vision language models (VLMs). SpinBench is designed around the core challenge of spatial reasoning: perspective taking, the ability to reason about how scenes and object relations change under viewpoint transformation. Since perspective taking requires multiple cognitive capabilities, such as recognizing objects across views, relative positions grounding, and mentally simulating transformations, SpinBench introduces a set of fine-grained diagnostic categories. Our categories target translation, rotation, object relative pose, and viewpoint change, and are progressively structured so that single-object simpler tasks scaffold toward the most demanding multi-object perspective-taking setting. We evaluate 37 state-of-the-art VLMs, both proprietary and open source. Results reveal systematic weaknesses: strong egocentric bias, poor rotational understanding, and inconsistencies under symmetrical and syntactic reformulations. Scaling analysis shows both smooth improvements and emergent capabilities. While human subjects achieve high accuracy (91.2\%), task difficulty as measured by human response time shows strong correlation with VLM accuracy, indicating that SpinBench captures spatial reasoning challenges shared across humans and VLMs. We believe SpinBench provides critical insights into spatial reasoning in VLMs and highlights key gaps in their ability to reason about physical space. Our website can be found at https://spinbench25.github.io/.
Related papers
- Learning Situated Awareness in the Real World [63.75211123289058]
SAW-Bench is a novel benchmark for evaluating egocentric situated awareness using real-world videos.<n>It probes a model's observer-centric understanding with six different awareness tasks.<n>Our comprehensive evaluation reveals a human-model performance gap of 37.66%, even with the best-performing MFM, Gemini 3 Flash.
arXiv Detail & Related papers (2026-02-18T18:22:52Z) - Egocentric Bias in Vision-Language Models [11.385014698426088]
We introduce FlipSet, a diagnostic benchmark for Level-2 visual perspective taking (L2 VPT) in vision-language models.<n>The task requires simulating 180-degree rotations of 2D character strings from another agent's perspective.<n>FlipSet provides a cognitively grounded testbed for diagnosing perspective-taking capabilities in multimodal systems.
arXiv Detail & Related papers (2026-02-10T03:51:00Z) - Allocentric Perceiver: Disentangling Allocentric Reasoning from Egocentric Visual Priors via Frame Instantiation [41.434638833165494]
Allocentric Perceiver is a training-free strategy that recovers metric 3D states from one or more images with off-the-shelf geometric experts.<n>Allocentric Perceriver offloads mental rotation from implicit reasoning to explicit computation.
arXiv Detail & Related papers (2026-02-05T15:45:39Z) - Reasoning Path and Latent State Analysis for Multi-view Visual Spatial Reasoning: A Cognitive Science Perspective [17.592210658831902]
spatial reasoning is a core aspect of human intelligence that allows perception, inference and planning in 3D environments.<n>Current vision-language models (VLMs) struggle to maintain geometric coherence and cross-view consistency for spatial reasoning in multi-view settings.<n>We present ReMindView-Bench, a cognitively grounded benchmark for evaluating how VLMs construct, align and maintain spatial mental models across complementary viewpoints.
arXiv Detail & Related papers (2025-12-02T02:21:29Z) - VLM4D: Towards Spatiotemporal Awareness in Vision Language Models [66.833085504228]
We introduce V4DLM, the first benchmark specifically designed to evaluate visual language models (VLMs)<n>Our benchmark comprises diverse real-world and synthetic videos accompanied by carefully curated question-answer pairs.<n>We identify significant performance gaps compared to human baselines, highlighting fundamental deficiencies in existing models.
arXiv Detail & Related papers (2025-08-04T06:06:06Z) - ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models [68.46716645478661]
Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content.<n>Current VLMs excel primarily at egocentric spatial reasoning (from the camera's perspective) but fail to generalize to allocentric viewpoints.<n>We introduce ViewSpatial-Bench, the first comprehensive benchmark designed specifically for multi-viewpoint spatial localization recognition evaluation.
arXiv Detail & Related papers (2025-05-27T17:59:26Z) - MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness [34.49001130529016]
We introduce MMPerspective, the first benchmark specifically designed to evaluate multimodal large language models' understanding of perspective.<n>Our benchmark comprises 2,711 real-world and synthetic image instances with 5,083 question-answer pairs that probe key capabilities.<n>Through a comprehensive evaluation of 43 state-of-the-art MLLMs, we uncover significant limitations.
arXiv Detail & Related papers (2025-05-26T18:20:22Z) - Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models [14.442394137843923]
We present a detailed analysis that first delineates the core elements of spatial reasoning.<n>We then assesses the performance of these models in both synthetic and real-world images.
arXiv Detail & Related papers (2025-03-25T14:34:06Z) - Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models [10.792834356227118]
Vision-Language Models (VLMs) excel at identifying and describing objects but struggle with spatial reasoning.<n>Inspired by the dual-pathway (ventral-dorsal) model of human vision, we investigate why VLMs fail spatial tasks despite strong object recognition capabilities.
arXiv Detail & Related papers (2025-03-21T17:51:14Z) - Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas [52.478956204238315]
We study the spatial reasoning challenge from the lens of mechanistic interpretability.<n>We observe that successful spatial reasoning correlates strongly with the model's ability to align its attention with actual object locations.<n>Motivated by these findings, we propose ADAPTVIS to sharpen the attention on highly relevant regions when confident.
arXiv Detail & Related papers (2025-03-03T17:57:03Z) - The Right Spin: Learning Object Motion from Rotation-Compensated Flow
Fields [61.664963331203666]
How humans perceive moving objects is a longstanding research question in computer vision.
One approach to the problem is to teach a deep network to model all of these effects.
We present a novel probabilistic model to estimate the camera's rotation given the motion field.
arXiv Detail & Related papers (2022-02-28T22:05:09Z) - Weakly Supervised Relative Spatial Reasoning for Visual Question
Answering [38.05223339919346]
We evaluate the faithfulness of V&L models to such geometric understanding.
We train V&L with weak supervision from off-the-shelf depth estimators.
This leads to considerable improvements in accuracy for the "GQA" visual question answering challenge.
arXiv Detail & Related papers (2021-09-04T21:29:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.