Diagnosing Vision Language Models' Perception by Leveraging Human Methods for Color Vision Deficiencies
- URL: http://arxiv.org/abs/2505.17461v1
- Date: Fri, 23 May 2025 04:43:55 GMT
- Title: Diagnosing Vision Language Models' Perception by Leveraging Human Methods for Color Vision Deficiencies
- Authors: Kazuki Hayashi, Shintaro Ozaki, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe,
- Abstract summary: We evaluate Vision Language Models' ability to account for individual level perceptual variation using the Ishihara Test.<n>Our results show that LVLMs can explain Color Vision Deficiencies in natural language, but they cannot simulate how people with CVDs perceive color in image based tasks.
- Score: 23.761989930955522
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale Vision Language Models (LVLMs) are increasingly being applied to a wide range of real-world multimodal applications, involving complex visual and linguistic reasoning. As these models become more integrated into practical use, they are expected to handle complex aspects of human interaction. Among these, color perception is a fundamental yet highly variable aspect of visual understanding. It differs across individuals due to biological factors such as Color Vision Deficiencies (CVDs), as well as differences in culture and language. Despite its importance, perceptual diversity has received limited attention. In our study, we evaluate LVLMs' ability to account for individual level perceptual variation using the Ishihara Test, a widely used method for detecting CVDs. Our results show that LVLMs can explain CVDs in natural language, but they cannot simulate how people with CVDs perceive color in image based tasks. These findings highlight the need for multimodal systems that can account for color perceptual diversity and support broader discussions on perceptual inclusiveness and fairness in multimodal AI.
Related papers
- ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs [98.27348724529257]
We introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions.<n>Models trained with the ViCrit Task exhibit substantial gains across a variety of vision-language models benchmarks.
arXiv Detail & Related papers (2025-06-11T19:16:54Z) - Visual Language Models show widespread visual deficits on neuropsychological tests [0.0]
We use the toolkit of neuropsychology to assess the capabilities of three state-of-the-art Visual Language Models (VLMs)<n>We find widespread deficits in low- and mid-level visual abilities that would be considered clinically significant in humans.<n>These selective deficits, profiled through validated test batteries, suggest that an artificial system can achieve complex object recognition without developing foundational visual concepts that in humans require no explicit training.
arXiv Detail & Related papers (2025-04-15T01:04:56Z) - Towards a Systematic Evaluation of Hallucinations in Large-Vision Language Models [57.58426038241812]
Large Vision-Language Models (LVLMs) have demonstrated remarkable performance in complex multimodal tasks.<n>These models still suffer from hallucinations when required to implicitly recognize or infer diverse visual entities from images.<n>We propose a novel visual question answering (VQA) benchmark that employs contextual reasoning prompts as hallucination attacks.
arXiv Detail & Related papers (2024-12-29T23:56:01Z) - With Ears to See and Eyes to Hear: Sound Symbolism Experiments with Multimodal Large Language Models [16.583370726582356]
We show that Vision Language Models (VLMs) can implicitly understand sound-based phenomena via abstract reasoning from orthography and imagery alone.
We perform experiments including replicating the classic Kiki-Bouba and Mil-Mal shape and magnitude symbolism tasks.
Our results show that VLMs demonstrate varying levels of agreement with human labels, and more task information may be required for VLMs versus their human counterparts for in silico experimentation.
arXiv Detail & Related papers (2024-09-23T11:13:25Z) - Beyond the Hype: A dispassionate look at vision-language models in medical scenario [3.4299097748670255]
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across diverse tasks.<n>Their performance and reliability in specialized domains such as medicine remain insufficiently assessed.<n>We introduce RadVUQA, a novel benchmark to comprehensively evaluate existing LVLMs.
arXiv Detail & Related papers (2024-08-16T12:32:44Z) - Modelling Multimodal Integration in Human Concept Processing with Vision-Language Models [7.511284868070148]
We investigate whether integration of visuo-linguistic information leads to representations that are more aligned with human brain activity.<n>Our findings indicate an advantage of multimodal models in predicting human brain activations.
arXiv Detail & Related papers (2024-07-25T10:08:37Z) - Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs [50.77984109941538]
Our research reveals that the visual capabilities in recent multimodal LLMs still exhibit systematic shortcomings.
We identify ''CLIP-blind pairs'' - images that CLIP perceives as similar despite their clear visual differences.
We evaluate various CLIP-based vision-and-language models and found a notable correlation between visual patterns that challenge CLIP models and those problematic for multimodal LLMs.
arXiv Detail & Related papers (2024-01-11T18:58:36Z) - Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models [56.257840490146]
ConCue is a novel approach for improving visual feature extraction in HOI detection.
We develop a transformer-based feature extraction module with a multi-tower architecture that integrates contextual cues into both instance and interaction detectors.
arXiv Detail & Related papers (2023-11-26T09:11:32Z) - Large Language Models are Visual Reasoning Coordinators [144.67558375045755]
We propose a novel paradigm that coordinates multiple vision-language models for visual reasoning.
We show that our instruction tuning variant, Cola-FT, achieves state-of-the-art performance on visual question answering.
We also show that our in-context learning variant, Cola-Zero, exhibits competitive performance in zero and few-shot settings.
arXiv Detail & Related papers (2023-10-23T17:59:31Z) - Large language models predict human sensory judgments across six
modalities [12.914521751805658]
We show that state-of-the-art large language models can unlock new insights into the problem of recovering the perceptual world from language.
We elicit pairwise similarity judgments from GPT models across six psychophysical datasets.
We show that the judgments are significantly correlated with human data across all domains, recovering well-known representations like the color wheel and pitch spiral.
arXiv Detail & Related papers (2023-02-02T18:32:46Z) - ColorSense: A Study on Color Vision in Machine Visual Recognition [57.916512479603064]
We collect 110,000 non-trivial human annotations of foreground and background color labels from visual recognition benchmarks.<n>We validate the use of our datasets by demonstrating that the level of color discrimination has a dominating effect on the performance of machine perception models.<n>Our findings suggest that object recognition tasks such as classification and localization are susceptible to color vision bias.
arXiv Detail & Related papers (2022-12-16T18:51:41Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z) - DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language.
We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language.
We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.