Keep it SymPL: Symbolic Projective Layout for Allocentric Spatial Reasoning in Vision-Language Models
- URL: http://arxiv.org/abs/2602.19117v2
- Date: Tue, 24 Feb 2026 10:19:29 GMT
- Title: Keep it SymPL: Symbolic Projective Layout for Allocentric Spatial Reasoning in Vision-Language Models
- Authors: Jaeyun Jang, Seunghui Shin, Taeho Park, Hyoseok Hwang,
- Abstract summary: We introduce Projective Layout (SymPL), a framework that reformulates allocentric reasoning into symbolic forms that VLMs handle well.<n>Experiments demonstrate that this reformulation substantially improves performance in both allocentric and egocentric tasks.
- Score: 5.961445903498366
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Perspective-aware spatial reasoning involves understanding spatial relationships from specific viewpoints-either egocentric (observer-centered) or allocentric (object-centered). While vision-language models (VLMs) perform well in egocentric settings, their performance deteriorates when reasoning from allocentric viewpoints, where spatial relations must be inferred from the perspective of objects within the scene. In this study, we address this underexplored challenge by introducing Symbolic Projective Layout (SymPL), a framework that reformulates allocentric reasoning into symbolic-layout forms that VLMs inherently handle well. By leveraging four key factors-projection, abstraction, bipartition, and localization-SymPL converts allocentric questions into structured symbolic-layout representations. Extensive experiments demonstrate that this reformulation substantially improves performance in both allocentric and egocentric tasks, enhances robustness under visual illusions and multi-view scenarios, and that each component contributes critically to these gains. These results show that SymPL provides an effective and principled approach for addressing complex perspective-aware spatial reasoning.
Related papers
- Allocentric Perceiver: Disentangling Allocentric Reasoning from Egocentric Visual Priors via Frame Instantiation [41.434638833165494]
Allocentric Perceiver is a training-free strategy that recovers metric 3D states from one or more images with off-the-shelf geometric experts.<n>Allocentric Perceriver offloads mental rotation from implicit reasoning to explicit computation.
arXiv Detail & Related papers (2026-02-05T15:45:39Z) - Cognitively-Inspired Tokens Overcome Egocentric Bias in Multimodal Models [0.0]
Multimodal language models (MLMs) fail at spatial reasoning that requires adopting another agent's visual perspective.<n>Inspired by human spatial cognition, we introduce perspective tokens that encode orientation through either (1) embodied body-keypoint cues or (2) abstract representations supporting mental rotation.<n>Across synthetic and naturalistic benchmarks, perspective tokens improve accuracy, with rotation-based tokens generalizing to non-human reference agents.
arXiv Detail & Related papers (2026-01-23T00:21:27Z) - Thinking with Blueprints: Assisting Vision-Language Models in Spatial Reasoning via Structured Object Representation [52.605647992080485]
spatial reasoning advances vision-language models from visual perception toward semantic understanding.<n>We integrate the cognitive concept of an object-centric blueprint into spatial reasoning.<n>Our method consistently outperforms existing vision-language models.
arXiv Detail & Related papers (2026-01-05T10:38:26Z) - SpatialDreamer: Incentivizing Spatial Reasoning via Active Mental Imagery [64.67498968405327]
SpatialDreamer is a reinforcement learning framework that enables spatial reasoning through a closedloop process of active exploration.<n>GeoPO introduces tree-structured sampling and step-level reward estimation with consistency geometric constraints.
arXiv Detail & Related papers (2025-12-08T17:20:50Z) - Reasoning Path and Latent State Analysis for Multi-view Visual Spatial Reasoning: A Cognitive Science Perspective [17.592210658831902]
spatial reasoning is a core aspect of human intelligence that allows perception, inference and planning in 3D environments.<n>Current vision-language models (VLMs) struggle to maintain geometric coherence and cross-view consistency for spatial reasoning in multi-view settings.<n>We present ReMindView-Bench, a cognitively grounded benchmark for evaluating how VLMs construct, align and maintain spatial mental models across complementary viewpoints.
arXiv Detail & Related papers (2025-12-02T02:21:29Z) - Artemis: Structured Visual Reasoning for Perception Policy Learning [64.57381337070616]
Empirical observations indicate that purely linguistic intermediate reasoning often reduces performance on perception tasks.<n>We introduce Artemis, a perception-policy learning framework that performs structured proposal-based reasoning.
arXiv Detail & Related papers (2025-12-01T18:45:30Z) - Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception [71.26728044621458]
DeCLIP is a novel framework that enhances CLIP by decoupling the self-attention module to obtain content'' and context'' features respectively.<n>It consistently achieves state-of-the-art performance across a broad spectrum of tasks, including 2D detection and segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation.
arXiv Detail & Related papers (2025-08-15T06:43:51Z) - ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models [68.46716645478661]
Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content.<n>Current VLMs excel primarily at egocentric spatial reasoning (from the camera's perspective) but fail to generalize to allocentric viewpoints.<n>We introduce ViewSpatial-Bench, the first comprehensive benchmark designed specifically for multi-viewpoint spatial localization recognition evaluation.
arXiv Detail & Related papers (2025-05-27T17:59:26Z) - MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness [50.33343842822694]
We introduce MMPerspective, the first benchmark specifically designed to evaluate multimodal large language models' understanding of perspective.<n>Our benchmark comprises 2,711 real-world and synthetic image instances with 5,083 question-answer pairs that probe key capabilities.<n>Through a comprehensive evaluation of 43 state-of-the-art MLLMs, we uncover significant limitations.
arXiv Detail & Related papers (2025-05-26T18:20:22Z) - SPHERE: Unveiling Spatial Blind Spots in Vision-Language Models Through Hierarchical Evaluation [7.659514491338669]
Current vision-language models may grasp basic spatial cues but struggle with the multi-dimensional spatial reasoning necessary for human-like understanding and real-world applications.<n>We develop SPHERE, a hierarchical evaluation framework supported by a new human-annotated dataset.<n> Benchmark evaluation of state-of-the-art models reveals significant deficiencies, especially in reasoning about distance and proximity.
arXiv Detail & Related papers (2024-12-17T09:10:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.