Cognitively-Inspired Tokens Overcome Egocentric Bias in Multimodal Models
- URL: http://arxiv.org/abs/2601.16378v1
- Date: Fri, 23 Jan 2026 00:21:27 GMT
- Title: Cognitively-Inspired Tokens Overcome Egocentric Bias in Multimodal Models
- Authors: Bridget Leonard, Scott O. Murray,
- Abstract summary: Multimodal language models (MLMs) fail at spatial reasoning that requires adopting another agent's visual perspective.<n>Inspired by human spatial cognition, we introduce perspective tokens that encode orientation through either (1) embodied body-keypoint cues or (2) abstract representations supporting mental rotation.<n>Across synthetic and naturalistic benchmarks, perspective tokens improve accuracy, with rotation-based tokens generalizing to non-human reference agents.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal language models (MLMs) perform well on semantic vision-language tasks but fail at spatial reasoning that requires adopting another agent's visual perspective. These errors reflect a persistent egocentric bias and raise questions about whether current models support allocentric reasoning. Inspired by human spatial cognition, we introduce perspective tokens, specialized embeddings that encode orientation through either (1) embodied body-keypoint cues or (2) abstract representations supporting mental rotation. Integrating these tokens into LLaVA-1.5-13B yields performance on level-2 visual perspective-taking tasks. Across synthetic and naturalistic benchmarks (Isle Bricks V2, COCO, 3DSRBench), perspective tokens improve accuracy, with rotation-based tokens generalizing to non-human reference agents. Representational analyses reveal that fine-tuning enhances latent orientation sensitivity already present in the base model, suggesting that MLMs contain precursors of allocentric reasoning but lack appropriate internal structure. Overall, embedding cognitively grounded spatial structure directly into token space provides a lightweight, model-agnostic mechanism for perspective-taking and more human-like spatial reasoning.
Related papers
- Keep it SymPL: Symbolic Projective Layout for Allocentric Spatial Reasoning in Vision-Language Models [5.961445903498366]
We introduce Projective Layout (SymPL), a framework that reformulates allocentric reasoning into symbolic forms that VLMs handle well.<n>Experiments demonstrate that this reformulation substantially improves performance in both allocentric and egocentric tasks.
arXiv Detail & Related papers (2026-02-22T10:18:54Z) - Revisiting Multi-Task Visual Representation Learning [52.93947931352643]
We introduce MTV, a principled multi-task visual pretraining framework.<n>We leverage high-capacity "expert" models to synthesize dense, structured pseudo-labels at scale.<n>Our results demonstrate that MTV achieves "best-of-both-worlds" performance.
arXiv Detail & Related papers (2026-01-20T11:59:19Z) - Thinking with Blueprints: Assisting Vision-Language Models in Spatial Reasoning via Structured Object Representation [52.605647992080485]
spatial reasoning advances vision-language models from visual perception toward semantic understanding.<n>We integrate the cognitive concept of an object-centric blueprint into spatial reasoning.<n>Our method consistently outperforms existing vision-language models.
arXiv Detail & Related papers (2026-01-05T10:38:26Z) - The Perceptual Observatory Characterizing Robustness and Grounding in MLLMs [44.71703930770065]
We present The Perceptual Observatory, a framework that characterizes MLLMs across verticals like face matching and text-in-vision comprehension capabilities.<n>The Perceptual Observatory moves beyond leaderboard accuracy to yield insights into how MLLMs preserve perceptual grounding and relational structure under perturbations.
arXiv Detail & Related papers (2025-12-17T20:22:23Z) - LTD-Bench: Evaluating Large Language Models by Letting Them Draw [57.237152905238084]
LTD-Bench is a breakthrough benchmark for large language models (LLMs)<n>It transforms LLM evaluation from abstract scores to directly observable visual outputs by requiring models to generate drawings through dot matrices or executable code.<n> LTD-Bench's visual outputs enable powerful diagnostic analysis, offering a potential approach to investigate model similarity.
arXiv Detail & Related papers (2025-11-04T08:11:23Z) - From Bias to Balance: Exploring and Mitigating Spatial Bias in LVLMs [57.01486941224062]
Large Vision-Language Models (LVLMs) have achieved remarkable success across a wide range of multimodal tasks.<n>We focus on how models respond when identical key visual information is placed at different locations within an image.<n>We introduce Balanced Position Assignment (BaPA), a simple yet effective mechanism that assigns identical position embeddings to all image tokens.
arXiv Detail & Related papers (2025-09-26T07:07:03Z) - ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models [68.46716645478661]
Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content.<n>Current VLMs excel primarily at egocentric spatial reasoning (from the camera's perspective) but fail to generalize to allocentric viewpoints.<n>We introduce ViewSpatial-Bench, the first comprehensive benchmark designed specifically for multi-viewpoint spatial localization recognition evaluation.
arXiv Detail & Related papers (2025-05-27T17:59:26Z) - MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness [50.33343842822694]
We introduce MMPerspective, the first benchmark specifically designed to evaluate multimodal large language models' understanding of perspective.<n>Our benchmark comprises 2,711 real-world and synthetic image instances with 5,083 question-answer pairs that probe key capabilities.<n>Through a comprehensive evaluation of 43 state-of-the-art MLLMs, we uncover significant limitations.
arXiv Detail & Related papers (2025-05-26T18:20:22Z) - Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models [14.442394137843923]
We present a detailed analysis that first delineates the core elements of spatial reasoning.<n>We then assesses the performance of these models in both synthetic and real-world images.
arXiv Detail & Related papers (2025-03-25T14:34:06Z) - Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models [13.768090541138571]
Vision Language Models (VLMs) excel at identifying and describing objects but often fail at spatial reasoning.<n>Our analysis reveals a key imbalance: vision token embeddings have much larger norms than text tokens.<n>Tools uncover that vision tokens and system prompts dominate attention.
arXiv Detail & Related papers (2025-03-21T17:51:14Z) - SPHERE: Unveiling Spatial Blind Spots in Vision-Language Models Through Hierarchical Evaluation [7.659514491338669]
Current vision-language models may grasp basic spatial cues but struggle with the multi-dimensional spatial reasoning necessary for human-like understanding and real-world applications.<n>We develop SPHERE, a hierarchical evaluation framework supported by a new human-annotated dataset.<n> Benchmark evaluation of state-of-the-art models reveals significant deficiencies, especially in reasoning about distance and proximity.
arXiv Detail & Related papers (2024-12-17T09:10:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.