Probing Perceptual Constancy in Large Vision Language Models
- URL: http://arxiv.org/abs/2502.10273v1
- Date: Fri, 14 Feb 2025 16:31:43 GMT
- Title: Probing Perceptual Constancy in Large Vision Language Models
- Authors: Haoran Sun, Suyang Yu, Yijiang Li, Qingying Gao, Haiyun Lyu, Hokin Deng, Dezhi Luo,
- Abstract summary: We evaluated 33 Vision-Language Models (VLMs) using 253 experiments across three domains: color, size, and shape constancy.
We found significant variability in VLM performance, with models performance in shape constancy clearly dissociated from that of color and size constancy.
- Score: 8.826002715344911
- License:
- Abstract: Perceptual constancy is the ability to maintain stable perceptions of objects despite changes in sensory input, such as variations in distance, angle, or lighting. This ability is crucial for recognizing visual information in a dynamic world, making it essential for Vision-Language Models (VLMs). However, whether VLMs are currently and theoretically capable of mastering this ability remains underexplored. In this study, we evaluated 33 VLMs using 253 experiments across three domains: color, size, and shape constancy. The experiments included single-image and video adaptations of classic cognitive tasks, along with novel tasks in in-the-wild conditions, to evaluate the models' recognition of object properties under varying conditions. We found significant variability in VLM performance, with models performance in shape constancy clearly dissociated from that of color and size constancy.
Related papers
- VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM)
VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety.
Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z) - Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations [15.035663040732798]
We investigate the internal representations of vision-language models (VLMs) to address hallucinations.
We project VLMs' internal image representations to their language vocabulary and observe more confident output probabilities on real objects than hallucinated objects.
We show that targeted edits to a model's latent representations can reduce hallucinations by up to 25.7% on the COCO2014 dataset.
arXiv Detail & Related papers (2024-10-03T17:59:57Z) - VLM's Eye Examination: Instruct and Inspect Visual Competency of Vision Language Models [19.291697178628546]
Vision language models (VLMs) have shown promising reasoning capabilities across various benchmarks.
In this work, we propose an eye examination process to investigate how a VLM perceives images.
arXiv Detail & Related papers (2024-09-23T07:15:29Z) - Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities [30.176918208200604]
Vision-Language Models (VLMs) have emerged as general purpose tools for addressing a variety of complex computer vision problems.
These models have been shown to be highly capable, but also lacking some basic visual understanding skills.
This paper sets out to understand the limitations of SoTA VLMs on fundamental visual tasks.
arXiv Detail & Related papers (2024-08-13T08:26:32Z) - GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs [3.2688425993442696]
The ability to understand and reason about spatial relationships between objects in images is an important component of visual reasoning.
We extend the previously released What'sUp dataset and propose a novel comprehensive evaluation for spatial relationship understanding.
arXiv Detail & Related papers (2024-06-19T06:15:26Z) - OSCaR: Object State Captioning and State Change Representation [52.13461424520107]
This paper introduces the Object State Captioning and State Change Representation (OSCaR) dataset and benchmark.
OSCaR consists of 14,084 annotated video segments with nearly 1,000 unique objects from various egocentric video collections.
It sets a new testbed for evaluating multimodal large language models (MLLMs)
arXiv Detail & Related papers (2024-02-27T01:48:19Z) - Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning.
Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z) - A General Protocol to Probe Large Vision Models for 3D Physical Understanding [84.54972153436466]
We introduce a general protocol to evaluate whether features of an off-the-shelf large vision model encode a number of physical 'properties' of the 3D scene.
We apply this protocol to properties covering scene geometry, scene material, support relations, lighting, and view-dependent measures.
We find that features from Stable Diffusion and DINOv2 are good for discriminative learning of a number of properties.
arXiv Detail & Related papers (2023-10-10T17:59:28Z) - Task Formulation Matters When Learning Continually: A Case Study in
Visual Question Answering [58.82325933356066]
Continual learning aims to train a model incrementally on a sequence of tasks without forgetting previous knowledge.
We present a detailed study of how different settings affect performance for Visual Question Answering.
arXiv Detail & Related papers (2022-09-30T19:12:58Z) - VIPHY: Probing "Visible" Physical Commonsense Knowledge [22.00069189468524]
Vision-language models (VLMs) have shown remarkable performance on visual reasoning tasks.
We evaluate their ability to acquire "visible" physical knowledge.
Our results indicate a severe gap between model and human performance.
arXiv Detail & Related papers (2022-09-15T02:06:25Z) - ACID: Action-Conditional Implicit Visual Dynamics for Deformable Object
Manipulation [135.10594078615952]
We introduce ACID, an action-conditional visual dynamics model for volumetric deformable objects.
A benchmark contains over 17,000 action trajectories with six types of plush toys and 78 variants.
Our model achieves the best performance in geometry, correspondence, and dynamics predictions.
arXiv Detail & Related papers (2022-03-14T04:56:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.