Related papers: Egocentric Bias in Vision-Language Models

Egocentric Bias in Vision-Language Models

URL: http://arxiv.org/abs/2602.15892v1
Date: Tue, 10 Feb 2026 03:51:00 GMT
Title: Egocentric Bias in Vision-Language Models
Authors: Maijunxian Wang, Yijiang Li, Bingyang Wang, Tianwei Zhao, Ran Ji, Qingying Gao, Emmy Liu, Hokin Deng, Dezhi Luo,
Abstract summary: We introduce FlipSet, a diagnostic benchmark for Level-2 visual perspective taking (L2 VPT) in vision-language models.<n>The task requires simulating 180-degree rotations of 2D character strings from another agent's perspective.<n>FlipSet provides a cognitively grounded testbed for diagnosing perspective-taking capabilities in multimodal systems.
Score: 11.385014698426088
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual perspective taking--inferring how the world appears from another's viewpoint--is foundational to social cognition. We introduce FlipSet, a diagnostic benchmark for Level-2 visual perspective taking (L2 VPT) in vision-language models. The task requires simulating 180-degree rotations of 2D character strings from another agent's perspective, isolating spatial transformation from 3D scene complexity. Evaluating 103 VLMs reveals systematic egocentric bias: the vast majority perform below chance, with roughly three-quarters of errors reproducing the camera viewpoint. Control experiments expose a compositional deficit--models achieve high theory-of-mind accuracy and above-chance mental rotation in isolation, yet fail catastrophically when integration is required. This dissociation indicates that current VLMs lack the mechanisms needed to bind social awareness to spatial operations, suggesting fundamental limitations in model-based spatial reasoning. FlipSet provides a cognitively grounded testbed for diagnosing perspective-taking capabilities in multimodal systems.

Related papers

Revisiting Multi-Task Visual Representation Learning [52.93947931352643]
We introduce MTV, a principled multi-task visual pretraining framework.<n>We leverage high-capacity "expert" models to synthesize dense, structured pseudo-labels at scale.<n>Our results demonstrate that MTV achieves "best-of-both-worlds" performance.
arXiv Detail & Related papers (2026-01-20T11:59:19Z)
The Perceptual Observatory Characterizing Robustness and Grounding in MLLMs [44.71703930770065]
We present The Perceptual Observatory, a framework that characterizes MLLMs across verticals like face matching and text-in-vision comprehension capabilities.<n>The Perceptual Observatory moves beyond leaderboard accuracy to yield insights into how MLLMs preserve perceptual grounding and relational structure under perturbations.
arXiv Detail & Related papers (2025-12-17T20:22:23Z)
Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving [48.512353531499286]
We introduce Percept-WAM, a perception-enhanced World-Awareness-Action Model that implicitly integrates 2D/3D scene understanding abilities within a single vision-language model (VLM)<n>We propose a grid-conditioned prediction mechanism for dense object perception, incorporating IoU-aware scoring and parallel autoregressive decoding, improving stability in long-tail, far-range, and small-object scenarios.<n>Experiments show that Percept-WAM matches or surpasses classical detectors and segmenters on downstream perception benchmarks, achieving 51.7/58.9 mAP on 2D detection and nuScenes BEV 3D detection
arXiv Detail & Related papers (2025-11-24T15:28:25Z)
Imagine in Space: Exploring the Frontier of Spatial Intelligence and Reasoning Efficiency in Vision Language Models [23.12717700882611]
spatial reasoning is a fundamental component of human cognition.<n>Current large language models (LLMs) and vision language models (VLMs) have demonstrated remarkable reasoning capabilities across logical inference, problem solving, and decision making.<n>We hypothesize that imagination, the internal simulation of spatial states, is the dominant reasoning mechanism within a spatial world model.
arXiv Detail & Related papers (2025-11-16T03:09:55Z)
MindJourney: Test-Time Scaling with World Models for Spatial Reasoning [97.61985090279961]
We propose MindJourney, a test-time scaling framework for vision-language models.<n>We show that MindJourney achieves over an average 7.7% performance boost on the representative spatial reasoning benchmark SAT.<n>Our method also improves upon the test-time inference VLMs trained through reinforcement learning.
arXiv Detail & Related papers (2025-07-16T17:59:36Z)
Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations [61.235500325327585]
Existing AI benchmarks primarily assess verbal reasoning, neglecting the complexities of non-verbal, multi-step visual simulation.<n>We introduce STARE, a benchmark designed to rigorously evaluate multimodal large language models on tasks better solved through visual simulation.<n>Our evaluations show that models excel at reasoning over simpler 2D transformations, but perform close to random chance on more complex tasks.
arXiv Detail & Related papers (2025-06-05T05:09:46Z)
ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models [68.46716645478661]
Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content.<n>Current VLMs excel primarily at egocentric spatial reasoning (from the camera's perspective) but fail to generalize to allocentric viewpoints.<n>We introduce ViewSpatial-Bench, the first comprehensive benchmark designed specifically for multi-viewpoint spatial localization recognition evaluation.
arXiv Detail & Related papers (2025-05-27T17:59:26Z)
MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness [50.33343842822694]
We introduce MMPerspective, the first benchmark specifically designed to evaluate multimodal large language models' understanding of perspective.<n>Our benchmark comprises 2,711 real-world and synthetic image instances with 5,083 question-answer pairs that probe key capabilities.<n>Through a comprehensive evaluation of 43 state-of-the-art MLLMs, we uncover significant limitations.
arXiv Detail & Related papers (2025-05-26T18:20:22Z)
Ego3DPose: Capturing 3D Cues from Binocular Egocentric Views [9.476008200056082]
Ego3DPose is a highly accurate binocular egocentric 3D pose reconstruction system. We propose a two-path network architecture with a path that estimates pose per limb independently with its binocular heatmaps. We propose a new perspective-aware representation using trigonometry, enabling the network to estimate the 3D orientation of limbs.
arXiv Detail & Related papers (2023-09-21T10:34:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.