Related papers: Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models

Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models

URL: http://arxiv.org/abs/2505.03821v1
Date: Sat, 03 May 2025 00:10:41 GMT
Title: Beyond Recognition: Evaluating Visual Perspective Taking in Vision Language Models
Authors: Gracjan Góral, Alicja Ziarko, Piotr Miłoś, Michał Nauman, Maciej Wołczyk, Michał Kosiński,
Abstract summary: We investigate the ability of Vision Language Models to perform visual perspective taking using a novel set of visual tasks inspired by established human tests.<n>Our approach leverages carefully controlled scenes, in which a single humanoid minifigure is paired with a single object.<n>Our analysis suggests a gap between surface-level object recognition and the deeper spatial and perspective reasoning required for complex visual tasks.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We investigate the ability of Vision Language Models (VLMs) to perform visual perspective taking using a novel set of visual tasks inspired by established human tests. Our approach leverages carefully controlled scenes, in which a single humanoid minifigure is paired with a single object. By systematically varying spatial configurations - such as object position relative to the humanoid minifigure and the humanoid minifigure's orientation - and using both bird's-eye and surface-level views, we created 144 unique visual tasks. Each visual task is paired with a series of 7 diagnostic questions designed to assess three levels of visual cognition: scene understanding, spatial reasoning, and visual perspective taking. Our evaluation of several state-of-the-art models, including GPT-4-Turbo, GPT-4o, Llama-3.2-11B-Vision-Instruct, and variants of Claude Sonnet, reveals that while they excel in scene understanding, the performance declines significantly on spatial reasoning and further deteriorates on perspective-taking. Our analysis suggests a gap between surface-level object recognition and the deeper spatial and perspective reasoning required for complex visual tasks, pointing to the need for integrating explicit geometric representations and tailored training protocols in future VLM development.

Related papers

Temporal Slowness in Central Vision Drives Semantic Object Learning [10.92192887447196]
Humans acquire semantic object representations from egocentric visual streams with minimal supervision.<n>This study investigates the role of central vision and slowness learning in the formation of semantic object representations from human-like visual experience.
arXiv Detail & Related papers (2026-02-04T11:47:58Z)
VisRes Bench: On Evaluating the Visual Reasoning Capabilities of VLMs [7.406217790017003]
We introduce VisRes Bench, a benchmark to study visual reasoning in naturalistic settings without contextual language supervision.<n>Analyzing model behavior across three levels of complexity, we uncover clear limitations in perceptual and relational visual reasoning capacities.<n>We conclude by discussing how VisRes provides a unified framework for advancing abstract visual reasoning in multimodal research.
arXiv Detail & Related papers (2025-12-24T14:18:38Z)
ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs [98.27348724529257]
We introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions.<n>Models trained with the ViCrit Task exhibit substantial gains across a variety of vision-language models benchmarks.
arXiv Detail & Related papers (2025-06-11T19:16:54Z)
ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models [68.46716645478661]
Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content.<n>Current VLMs excel primarily at egocentric spatial reasoning (from the camera's perspective) but fail to generalize to allocentric viewpoints.<n>We introduce ViewSpatial-Bench, the first comprehensive benchmark designed specifically for multi-viewpoint spatial localization recognition evaluation.
arXiv Detail & Related papers (2025-05-27T17:59:26Z)
MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness [34.49001130529016]
We introduce MMPerspective, the first benchmark specifically designed to evaluate multimodal large language models' understanding of perspective.<n>Our benchmark comprises 2,711 real-world and synthetic image instances with 5,083 question-answer pairs that probe key capabilities.<n>Through a comprehensive evaluation of 43 state-of-the-art MLLMs, we uncover significant limitations.
arXiv Detail & Related papers (2025-05-26T18:20:22Z)
VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use [74.39058448757645]
We present VipAct, an agent framework that enhances vision-language models (VLMs) VipAct consists of an orchestrator agent, which manages task requirement analysis, planning, and coordination, along with specialized agents that handle specific tasks. We evaluate VipAct on benchmarks featuring a diverse set of visual perception tasks, with experimental results demonstrating significant performance improvements.
arXiv Detail & Related papers (2024-10-21T18:10:26Z)
When Does Perceptual Alignment Benefit Vision Representations? [76.32336818860965]
We investigate how aligning vision model representations to human perceptual judgments impacts their usability. We find that aligning models to perceptual judgments yields representations that improve upon the original backbones across many downstream tasks. Our results suggest that injecting an inductive bias about human perceptual knowledge into vision models can contribute to better representations.
arXiv Detail & Related papers (2024-10-14T17:59:58Z)
Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding [41.59673370285659]
We present a comprehensive study that probes various visual encoding models for 3D scene understanding. Our evaluation spans seven vision foundation encoders, including image-based, video-based, and 3D foundation models. Our evaluations yield key findings: DINOv2 demonstrates superior performance, video models excel in object-level tasks, geometric diffusion models benefit tasks, and language-pretrained models show unexpected limitations in language-related tasks.
arXiv Detail & Related papers (2024-09-05T17:59:56Z)
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning. Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z)
Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects [11.117055725415446]
Large Vision Language Models (LVLMs) have demonstrated impressive zero-shot capabilities in various vision-language dialogue scenarios. The absence of fine-grained visual object detection hinders the model from understanding the details of images, leading to irreparable visual hallucinations and factual errors. We propose Lyrics, a novel multi-modal pre-training and instruction fine-tuning paradigm that bootstraps vision-language alignment from fine-grained cross-modal collaboration.
arXiv Detail & Related papers (2023-12-08T09:02:45Z)
Perceive, Ground, Reason, and Act: A Benchmark for General-purpose Visual Representation [26.039045505150526]
Current computer vision models, unlike the human visual system, cannot yet achieve general-purpose visual understanding. We present a new comprehensive benchmark, General Visual Understanding Evaluation, covering the full spectrum of visual cognitive abilities.
arXiv Detail & Related papers (2022-11-28T15:06:07Z)
Neural Groundplans: Persistent Neural Scene Representations from a Single Image [90.04272671464238]
We present a method to map 2D image observations of a scene to a persistent 3D scene representation. We propose conditional neural groundplans as persistent and memory-efficient scene representations.
arXiv Detail & Related papers (2022-07-22T17:41:24Z)
Peripheral Vision Transformer [52.55309200601883]
We take a biologically inspired approach and explore to model peripheral vision in deep neural networks for visual recognition. We propose to incorporate peripheral position encoding to the multi-head self-attention layers to let the network learn to partition the visual field into diverse peripheral regions given training data. We evaluate the proposed network, dubbed PerViT, on the large-scale ImageNet dataset and systematically investigate the inner workings of the model for machine perception.
arXiv Detail & Related papers (2022-06-14T12:47:47Z)
3D Neural Scene Representations for Visuomotor Control [78.79583457239836]
We learn models for dynamic 3D scenes purely from 2D visual observations. A dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks.
arXiv Detail & Related papers (2021-07-08T17:49:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.