ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness
- URL: http://arxiv.org/abs/2504.10514v1
- Date: Thu, 10 Apr 2025 16:36:26 GMT
- Title: ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness
- Authors: Yijun Liang, Ming Li, Chenrui Fan, Ziyue Li, Dang Nguyen, Kwesi Cobbina, Shweta Bhardwaj, Jiuhai Chen, Fuxiao Liu, Tianyi Zhou,
- Abstract summary: It is unclear whether vision-language models (VLMs) can perceive, understand, and leverage color as humans.<n>This paper introduces ColorBench, a benchmark to assess the capabilities of VLMs in color understanding.
- Score: 23.857004537384
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Color plays an important role in human perception and usually provides critical clues in visual reasoning. However, it is unclear whether and how vision-language models (VLMs) can perceive, understand, and leverage color as humans. This paper introduces ColorBench, an innovative benchmark meticulously crafted to assess the capabilities of VLMs in color understanding, including color perception, reasoning, and robustness. By curating a suite of diverse test scenarios, with grounding in real applications, ColorBench evaluates how these models perceive colors, infer meanings from color-based cues, and maintain consistent performance under varying color transformations. Through an extensive evaluation of 32 VLMs with varying language models and vision encoders, our paper reveals some undiscovered findings: (i) The scaling law (larger models are better) still holds on ColorBench, while the language model plays a more important role than the vision encoder. (ii) However, the performance gaps across models are relatively small, indicating that color understanding has been largely neglected by existing VLMs. (iii) CoT reasoning improves color understanding accuracies and robustness, though they are vision-centric tasks. (iv) Color clues are indeed leveraged by VLMs on ColorBench but they can also mislead models in some tasks. These findings highlight the critical limitations of current VLMs and underscore the need to enhance color comprehension. Our ColorBenchcan serve as a foundational tool for advancing the study of human-level color understanding of multimodal AI.
Related papers
- Probing Perceptual Constancy in Large Vision Language Models [8.826002715344911]
We evaluated 33 Vision-Language Models (VLMs) using 253 experiments across three domains: color, size, and shape constancy.<n>We found significant variability in VLM performance, with models performance in shape constancy clearly dissociated from that of color and size constancy.
arXiv Detail & Related papers (2025-02-14T16:31:43Z) - Color in Visual-Language Models: CLIP deficiencies [1.0159205678719043]
This work explores how color is encoded in CLIP (Contrastive Language-Image Pre-training) which is currently the most influential VML (Visual Language model) in Artificial Intelligence.<n>We come across two main deficiencies: (a) a clear bias on achromatic stimuli that are poorly related to the color concept, and (b) the tendency to prioritize text over other visual information.
arXiv Detail & Related papers (2025-02-06T19:38:12Z) - MegaCOIN: Enhancing Medium-Grained Color Perception for Vision-Language Models [60.1668189937952]
MegaCOIN is a high-quality, human-labeled dataset based on emphreal images with various contextual attributes.
MegaCOIN consists of two parts: MegaCOIN-Instruct, which serves as a supervised fine-tuning dataset for vision-language models; and MegaCOIN-Bench, an annotated test set that can be used as a stand-alone QA dataset.
arXiv Detail & Related papers (2024-12-05T07:06:17Z) - VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM)
VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety.
Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z) - VLM's Eye Examination: Instruct and Inspect Visual Competency of Vision Language Models [19.291697178628546]
Vision language models (VLMs) have shown promising reasoning capabilities across various benchmarks.
In this work, we propose an eye examination process to investigate how a VLM perceives images.
arXiv Detail & Related papers (2024-09-23T07:15:29Z) - Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities [30.176918208200604]
Vision-Language Models (VLMs) have emerged as general purpose tools for addressing a variety of complex computer vision problems.
These models have been shown to be highly capable, but also lacking some basic visual understanding skills.
This paper sets out to understand the limitations of SoTA VLMs on fundamental visual tasks.
arXiv Detail & Related papers (2024-08-13T08:26:32Z) - ColorFoil: Investigating Color Blindness in Large Vision and Language Models [0.0]
We introduce a novel V&L benchmark - ColorFoil.<n>We evaluate seven state-of-the-art V&L models including CLIP, ViLT, GroupViT, and BridgeTower.
arXiv Detail & Related papers (2024-05-19T22:04:57Z) - CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models [58.95889895912716]
We introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension.
Our findings indicate that MLLMs consistently fall short of human performance on this benchmark.
This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner.
arXiv Detail & Related papers (2024-02-21T08:21:12Z) - Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning.
Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z) - Pre-Training LiDAR-Based 3D Object Detectors Through Colorization [65.03659880456048]
We introduce an innovative pre-training approach, Grounded Point Colorization (GPC), to bridge the gap between data and labels.
GPC teaches the model to colorize LiDAR point clouds, equipping it with valuable semantic cues.
Experimental results on the KITTI and datasets demonstrate GPC's remarkable effectiveness.
arXiv Detail & Related papers (2023-10-23T06:00:24Z) - ColorSense: A Study on Color Vision in Machine Visual Recognition [57.916512479603064]
We collect 110,000 non-trivial human annotations of foreground and background color labels from visual recognition benchmarks.<n>We validate the use of our datasets by demonstrating that the level of color discrimination has a dominating effect on the performance of machine perception models.<n>Our findings suggest that object recognition tasks such as classification and localization are susceptible to color vision bias.
arXiv Detail & Related papers (2022-12-16T18:51:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.