Related papers: Probing Perceptual Constancy in Large Vision Language Models

Probing Perceptual Constancy in Large Vision Language Models

URL: http://arxiv.org/abs/2502.10273v1
Date: Fri, 14 Feb 2025 16:31:43 GMT
Title: Probing Perceptual Constancy in Large Vision Language Models
Authors: Haoran Sun, Suyang Yu, Yijiang Li, Qingying Gao, Haiyun Lyu, Hokin Deng, Dezhi Luo,
Abstract summary: We evaluated 33 Vision-Language Models (VLMs) using 253 experiments across three domains: color, size, and shape constancy.<n>We found significant variability in VLM performance, with models performance in shape constancy clearly dissociated from that of color and size constancy.
Score: 8.826002715344911
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Perceptual constancy is the ability to maintain stable perceptions of objects despite changes in sensory input, such as variations in distance, angle, or lighting. This ability is crucial for recognizing visual information in a dynamic world, making it essential for Vision-Language Models (VLMs). However, whether VLMs are currently and theoretically capable of mastering this ability remains underexplored. In this study, we evaluated 33 VLMs using 253 experiments across three domains: color, size, and shape constancy. The experiments included single-image and video adaptations of classic cognitive tasks, along with novel tasks in in-the-wild conditions, to evaluate the models' recognition of object properties under varying conditions. We found significant variability in VLM performance, with models performance in shape constancy clearly dissociated from that of color and size constancy.

Related papers

VLM4D: Towards Spatiotemporal Awareness in Vision Language Models [66.833085504228]
We introduce V4DLM, the first benchmark specifically designed to evaluate visual language models (VLMs)<n>Our benchmark comprises diverse real-world and synthetic videos accompanied by carefully curated question-answer pairs.<n>We identify significant performance gaps compared to human baselines, highlighting fundamental deficiencies in existing models.
arXiv Detail & Related papers (2025-08-04T06:06:06Z)
Response Wide Shut? Surprising Observations in Basic Vision Language Model Capabilities [54.94982467313341]
Vision-language Models (VLMs) have emerged as general-purpose tools for addressing a variety of complex computer vision problems.<n>We set out to understand the limitations of SoTA VLMs on fundamental visual tasks by constructing a series of tests that probe which components of design, specifically, may be lacking.
arXiv Detail & Related papers (2025-07-10T15:26:41Z)
Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation [54.3628937181904]
Internal world models (WMs) enable agents to understand the world's state and predict transitions.<n>Recent large Vision-Language Models (VLMs), such as OpenAI o3, GPT-4o and Gemini, exhibit potential as general-purpose WMs.
arXiv Detail & Related papers (2025-06-27T03:24:29Z)
Vision language models are unreliable at trivial spatial cognition [0.2902243522110345]
Vision language models (VLMs) are designed to extract relevant visuospatial information from images. We develop a benchmark dataset -- TableTest -- whose images depict 3D scenes of objects arranged on a table, and used it to evaluate state-of-the-art VLMs. Results show that performance could be degraded by minor variations of prompts that use equivalent descriptions.
arXiv Detail & Related papers (2025-04-22T17:38:01Z)
ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness [23.857004537384]
It is unclear whether vision-language models (VLMs) can perceive, understand, and leverage color as humans. This paper introduces ColorBench, a benchmark to assess the capabilities of VLMs in color understanding.
arXiv Detail & Related papers (2025-04-10T16:36:26Z)
V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models [84.27290155010533]
V-MAGE is a game-based evaluation framework designed to assess visual reasoning capabilities of MLLMs. We use V-MAGE to evaluate leading MLLMs, revealing significant challenges in their visual perception and reasoning.
arXiv Detail & Related papers (2025-04-08T15:43:01Z)
VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM) VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety. Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z)
Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations [15.035663040732798]
We investigate the internal representations of vision-language models (VLMs) to address hallucinations.<n>We project VLMs' internal image representations to their language vocabulary and observe more confident output probabilities on real objects than hallucinated objects.<n>We show that targeted edits to a model's latent representations can reduce hallucinations by up to 25.7% on the COCO2014 dataset.
arXiv Detail & Related papers (2024-10-03T17:59:57Z)
VLM's Eye Examination: Instruct and Inspect Visual Competency of Vision Language Models [19.291697178628546]
Vision language models (VLMs) have shown promising reasoning capabilities across various benchmarks. In this work, we propose an eye examination process to investigate how a VLM perceives images.
arXiv Detail & Related papers (2024-09-23T07:15:29Z)
Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities [30.176918208200604]
Vision-Language Models (VLMs) have emerged as general purpose tools for addressing a variety of complex computer vision problems. These models have been shown to be highly capable, but also lacking some basic visual understanding skills. This paper sets out to understand the limitations of SoTA VLMs on fundamental visual tasks.
arXiv Detail & Related papers (2024-08-13T08:26:32Z)
OSCaR: Object State Captioning and State Change Representation [52.13461424520107]
This paper introduces the Object State Captioning and State Change Representation (OSCaR) dataset and benchmark. OSCaR consists of 14,084 annotated video segments with nearly 1,000 unique objects from various egocentric video collections. It sets a new testbed for evaluating multimodal large language models (MLLMs)
arXiv Detail & Related papers (2024-02-27T01:48:19Z)
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning. Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z)
A General Protocol to Probe Large Vision Models for 3D Physical Understanding [84.54972153436466]
We introduce a general protocol to evaluate whether features of an off-the-shelf large vision model encode a number of physical 'properties' of the 3D scene. We apply this protocol to properties covering scene geometry, scene material, support relations, lighting, and view-dependent measures. We find that features from Stable Diffusion and DINOv2 are good for discriminative learning of a number of properties.
arXiv Detail & Related papers (2023-10-10T17:59:28Z)
Task Formulation Matters When Learning Continually: A Case Study in Visual Question Answering [58.82325933356066]
Continual learning aims to train a model incrementally on a sequence of tasks without forgetting previous knowledge. We present a detailed study of how different settings affect performance for Visual Question Answering.
arXiv Detail & Related papers (2022-09-30T19:12:58Z)
VIPHY: Probing "Visible" Physical Commonsense Knowledge [22.00069189468524]
Vision-language models (VLMs) have shown remarkable performance on visual reasoning tasks. We evaluate their ability to acquire "visible" physical knowledge. Our results indicate a severe gap between model and human performance.
arXiv Detail & Related papers (2022-09-15T02:06:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.