ColorFoil: Investigating Color Blindness in Large Vision and Language Models
- URL: http://arxiv.org/abs/2405.11685v2
- Date: Sat, 04 Jan 2025 19:33:49 GMT
- Title: ColorFoil: Investigating Color Blindness in Large Vision and Language Models
- Authors: Ahnaf Mozib Samin, M. Firoz Ahmed, Md. Mushtaq Shahriyar Rafee,
- Abstract summary: We introduce a novel V&L benchmark - ColorFoil.<n>We evaluate seven state-of-the-art V&L models including CLIP, ViLT, GroupViT, and BridgeTower.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the utilization of Transformer architecture, large Vision and Language (V&L) models have shown promising performance in even zero-shot settings. Several studies, however, indicate a lack of robustness of the models when dealing with complex linguistics and visual attributes. In this work, we introduce a novel V&L benchmark - ColorFoil, by creating color-related foils to assess the models' perception ability to detect colors like red, white, green, etc. We evaluate seven state-of-the-art V&L models including CLIP, ViLT, GroupViT, and BridgeTower, etc. in a zero-shot setting and present intriguing findings from the V&L models. The experimental evaluation indicates that ViLT and BridgeTower demonstrate much better color perception capabilities compared to CLIP and its variants and GroupViT. Moreover, CLIP-based models and GroupViT struggle to distinguish colors that are visually distinct to humans with normal color perception ability.
Related papers
- ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness [23.857004537384]
It is unclear whether vision-language models (VLMs) can perceive, understand, and leverage color as humans.
This paper introduces ColorBench, a benchmark to assess the capabilities of VLMs in color understanding.
arXiv Detail & Related papers (2025-04-10T16:36:26Z) - Color in Visual-Language Models: CLIP deficiencies [1.0159205678719043]
This work explores how color is encoded in CLIP (Contrastive Language-Image Pre-training) which is currently the most influential VML (Visual Language model) in Artificial Intelligence.
We come across two main deficiencies: (a) a clear bias on achromatic stimuli that are poorly related to the color concept, and (b) the tendency to prioritize text over other visual information.
arXiv Detail & Related papers (2025-02-06T19:38:12Z) - ViTOC: Vision Transformer and Object-aware Captioner [0.0]
ViTOC is a vision-language model for image captioning that addresses the challenges of accuracy and diversity in generated descriptions.
By utilizing pretrained visual model parameters, ViTOC achieves efficient end-to-end training.
arXiv Detail & Related papers (2024-11-09T13:13:49Z) - VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM)
VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety.
Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z) - VLM's Eye Examination: Instruct and Inspect Visual Competency of Vision Language Models [19.291697178628546]
Vision language models (VLMs) have shown promising reasoning capabilities across various benchmarks.
In this work, we propose an eye examination process to investigate how a VLM perceives images.
arXiv Detail & Related papers (2024-09-23T07:15:29Z) - They're All Doctors: Synthesizing Diverse Counterfactuals to Mitigate Associative Bias [34.005902280160356]
We propose a novel framework to generate synthetic counterfactual images that can be used to fine-tune CLIP.
We show that our fine-tuned CLIP model, $CF_alpha$, improves key fairness metrics such as MaxSkew, MinSkew, and NDKL by 40-66% for image retrieval tasks.
arXiv Detail & Related papers (2024-06-17T08:42:19Z) - Cartoon Hallucinations Detection: Pose-aware In Context Visual Learning [5.9024599926156744]
We propose a novel visual hallucination detection system for cartoon character images generated by TTI models.
Our approach leverages pose-aware in-context visual learning (PA-ICVL) with Vision-Language Models (VLMs), utilizing both RGB images and pose information.
arXiv Detail & Related papers (2024-03-22T09:13:09Z) - Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning.
Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z) - Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs [50.77984109941538]
Our research reveals that the visual capabilities in recent multimodal LLMs still exhibit systematic shortcomings.
We identify ''CLIP-blind pairs'' - images that CLIP perceives as similar despite their clear visual differences.
We evaluate various CLIP-based vision-and-language models and found a notable correlation between visual patterns that challenge CLIP models and those problematic for multimodal LLMs.
arXiv Detail & Related papers (2024-01-11T18:58:36Z) - Q-Instruct: Improving Low-level Visual Abilities for Multi-modality
Foundation Models [81.20804369985376]
We conduct a large-scale subjective experiment collecting a vast number of real human feedbacks on low-level vision.
The constructed **Q-Pathway** dataset includes 58K detailed human feedbacks on 18,973 images.
We design a GPT-participated conversion to process these feedbacks into diverse-format 200K instruction-response pairs.
arXiv Detail & Related papers (2023-11-12T09:10:51Z) - ColorSense: A Study on Color Vision in Machine Visual Recognition [57.916512479603064]
We collect 110,000 non-trivial human annotations of foreground and background color labels from visual recognition benchmarks.
We validate the use of our datasets by demonstrating that the level of color discrimination has a dominating effect on the performance of machine perception models.
Our findings suggest that object recognition tasks such as classification and localization are susceptible to color vision bias.
arXiv Detail & Related papers (2022-12-16T18:51:41Z) - UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes [91.24112204588353]
We introduce UViM, a unified approach capable of modeling a wide range of computer vision tasks.
In contrast to previous models, UViM has the same functional form for all tasks.
We demonstrate the effectiveness of UViM on three diverse and challenging vision tasks.
arXiv Detail & Related papers (2022-05-20T17:47:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.