Allocentric Perceiver: Disentangling Allocentric Reasoning from Egocentric Visual Priors via Frame Instantiation
- URL: http://arxiv.org/abs/2602.05789v1
- Date: Thu, 05 Feb 2026 15:45:39 GMT
- Title: Allocentric Perceiver: Disentangling Allocentric Reasoning from Egocentric Visual Priors via Frame Instantiation
- Authors: Hengyi Wang, Ruiqiang Zhang, Chang Liu, Guanjie Wang, Zehua Ma, Han Fang, Weiming Zhang,
- Abstract summary: Allocentric Perceiver is a training-free strategy that recovers metric 3D states from one or more images with off-the-shelf geometric experts.<n>Allocentric Perceriver offloads mental rotation from implicit reasoning to explicit computation.
- Score: 41.434638833165494
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: With the rising need for spatially grounded tasks such as Vision-Language Navigation/Action, allocentric perception capabilities in Vision-Language Models (VLMs) are receiving growing focus. However, VLMs remain brittle on allocentric spatial queries that require explicit perspective shifts, where the answer depends on reasoning in a target-centric frame rather than the observed camera view. Thus, we introduce Allocentric Perceiver, a training-free strategy that recovers metric 3D states from one or more images with off-the-shelf geometric experts, and then instantiates a query-conditioned allocentric reference frame aligned with the instruction's semantic intent. By deterministically transforming reconstructed geometry into the target frame and prompting the backbone VLM with structured, geometry-grounded representations, Allocentric Perceriver offloads mental rotation from implicit reasoning to explicit computation. We evaluate Allocentric Perciver across multiple backbone families on spatial reasoning benchmarks, observing consistent and substantial gains ($\sim$10%) on allocentric tasks while maintaining strong egocentric performance, and surpassing both spatial-perception-finetuned models and state-of-the-art open-source and proprietary models.
Related papers
- Keep it SymPL: Symbolic Projective Layout for Allocentric Spatial Reasoning in Vision-Language Models [5.961445903498366]
We introduce Projective Layout (SymPL), a framework that reformulates allocentric reasoning into symbolic forms that VLMs handle well.<n>Experiments demonstrate that this reformulation substantially improves performance in both allocentric and egocentric tasks.
arXiv Detail & Related papers (2026-02-22T10:18:54Z) - Thinking with Geometry: Active Geometry Integration for Spatial Reasoning [68.59084007360615]
We propose GeoThinker, a framework that shifts paradigm passive fusion to active perception.<n>Instead of feature mixing, GeoThinker enables the model to selectively retrieve geometric evidence conditioned on its internal reasoning demands.<n>Our results indicate that the ability to actively integrate spatial structures is essential for next-generation spatial intelligence.
arXiv Detail & Related papers (2026-02-05T18:59:32Z) - RSGround-R1: Rethinking Remote Sensing Visual Grounding through Spatial Reasoning [61.84363374647606]
Remote Sensing Visual Grounding (RSVG) aims to localize target objects in large-scale aerial imagery based on natural language descriptions.<n>These descriptions often rely heavily on positional cues, posing unique challenges for Multimodal Large Language Models (MLLMs) in spatial reasoning.<n>We propose a reasoning-guided, position-aware post-training framework, dubbed textbfRSGround-R1, to progressively enhance spatial understanding.
arXiv Detail & Related papers (2026-01-29T12:35:57Z) - Cognitively-Inspired Tokens Overcome Egocentric Bias in Multimodal Models [0.0]
Multimodal language models (MLMs) fail at spatial reasoning that requires adopting another agent's visual perspective.<n>Inspired by human spatial cognition, we introduce perspective tokens that encode orientation through either (1) embodied body-keypoint cues or (2) abstract representations supporting mental rotation.<n>Across synthetic and naturalistic benchmarks, perspective tokens improve accuracy, with rotation-based tokens generalizing to non-human reference agents.
arXiv Detail & Related papers (2026-01-23T00:21:27Z) - Thinking with Blueprints: Assisting Vision-Language Models in Spatial Reasoning via Structured Object Representation [52.605647992080485]
spatial reasoning advances vision-language models from visual perception toward semantic understanding.<n>We integrate the cognitive concept of an object-centric blueprint into spatial reasoning.<n>Our method consistently outperforms existing vision-language models.
arXiv Detail & Related papers (2026-01-05T10:38:26Z) - EagleVision: A Dual-Stage Framework with BEV-grounding-based Chain-of-Thought for Spatial Intelligence [10.889641815961133]
spatial intelligence approaches typically attach 3D cues to 2D reasoning pipelines or MLLMs with black-box reconstruction modules.<n>We present EagleVision, a framework for progressive spatial cognition through macro perception and micro verification.
arXiv Detail & Related papers (2025-12-17T07:51:36Z) - CVP: Central-Peripheral Vision-Inspired Multimodal Model for Spatial Reasoning [48.36177110428022]
We present a central-peripheral vision-inspired framework (CVP) for spatial reasoning.<n>CVP draws inspiration from the two types of human visual fields -- central vision and peripheral vision.<n> Experiments show that CVP achieves state-of-the-art performance across a range of 3D scene understanding benchmarks.
arXiv Detail & Related papers (2025-12-09T00:21:13Z) - Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception [71.26728044621458]
DeCLIP is a novel framework that enhances CLIP by decoupling the self-attention module to obtain content'' and context'' features respectively.<n>It consistently achieves state-of-the-art performance across a broad spectrum of tasks, including 2D detection and segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation.
arXiv Detail & Related papers (2025-08-15T06:43:51Z) - ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models [68.46716645478661]
Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content.<n>Current VLMs excel primarily at egocentric spatial reasoning (from the camera's perspective) but fail to generalize to allocentric viewpoints.<n>We introduce ViewSpatial-Bench, the first comprehensive benchmark designed specifically for multi-viewpoint spatial localization recognition evaluation.
arXiv Detail & Related papers (2025-05-27T17:59:26Z) - Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models [14.442394137843923]
We present a detailed analysis that first delineates the core elements of spatial reasoning.<n>We then assesses the performance of these models in both synthetic and real-world images.
arXiv Detail & Related papers (2025-03-25T14:34:06Z) - SPHERE: Unveiling Spatial Blind Spots in Vision-Language Models Through Hierarchical Evaluation [7.659514491338669]
Current vision-language models may grasp basic spatial cues but struggle with the multi-dimensional spatial reasoning necessary for human-like understanding and real-world applications.<n>We develop SPHERE, a hierarchical evaluation framework supported by a new human-annotated dataset.<n> Benchmark evaluation of state-of-the-art models reveals significant deficiencies, especially in reasoning about distance and proximity.
arXiv Detail & Related papers (2024-12-17T09:10:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.