Related papers: Assessing Color Vision Test in Large Vision-language Models

Assessing Color Vision Test in Large Vision-language Models

URL: http://arxiv.org/abs/2507.11153v1
Date: Tue, 15 Jul 2025 10:03:06 GMT
Title: Assessing Color Vision Test in Large Vision-language Models
Authors: Hongfei Ye, Bin Chen, Wenxi Liu, Yu Zhang, Zhao Li, Dandan Ni, Hongyang Chen,
Abstract summary: We define a color vision testing task for large vision-language models and construct a dataset.<n>We analyze the types of errors made by large vision-language models and propose fine-tuning strategies to enhance their performance in color vision tests.
Score: 27.393293081308222
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: With the widespread adoption of large vision-language models, the capacity for color vision in these models is crucial. However, the color vision abilities of large visual-language models have not yet been thoroughly explored. To address this gap, we define a color vision testing task for large vision-language models and construct a dataset \footnote{Anonymous Github Showing some of the data https://anonymous.4open.science/r/color-vision-test-dataset-3BCD} that covers multiple categories of test questions and tasks of varying difficulty levels. Furthermore, we analyze the types of errors made by large vision-language models and propose fine-tuning strategies to enhance their performance in color vision tests.

Related papers

Autoregressive Models in Vision: A Survey [119.23742136065307]
This survey comprehensively examines the literature on autoregressive models applied to vision.<n>We divide visual autoregressive models into three general sub-categories, including pixel-based, token-based, and scale-based models.<n>We present a multifaceted categorization of autoregressive models in computer vision, including image generation, video generation, 3D generation, and multimodal generation.
arXiv Detail & Related papers (2024-11-08T17:15:12Z)
On Efficient Language and Vision Assistants for Visually-Situated Natural Language Understanding: What Matters in Reading and Reasoning [33.89483627891117]
Recent advancements in language and vision assistants have showcased impressive capabilities but suffer from a lack of transparency. Open-source models handle general image tasks effectively, but face challenges with the high computational demands of complex visually-situated text understanding. This study aims to redefine the design of vision-language models by identifying key components and creating efficient models with constrained inference costs.
arXiv Detail & Related papers (2024-06-17T17:57:30Z)
ColorFoil: Investigating Color Blindness in Large Vision and Language Models [0.0]
We introduce a novel V&L benchmark - ColorFoil.<n>We evaluate seven state-of-the-art V&L models including CLIP, ViLT, GroupViT, and BridgeTower.
arXiv Detail & Related papers (2024-05-19T22:04:57Z)
Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset. We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding. Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z)
Sequential Modeling Enables Scalable Learning for Large Vision Models [120.91839619284431]
We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data. We define a common format, "visual sentences", in which we can represent raw images and videos as well as annotated data sources.
arXiv Detail & Related papers (2023-12-01T18:59:57Z)
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning [65.60607895153692]
MiniGPT-v2 is a model that can be treated as a unified interface for better handling various vision-language tasks. We propose using unique identifiers for different tasks when training the model. Our results show that MiniGPT-v2 achieves strong performance on many visual question-answering and visual grounding benchmarks.
arXiv Detail & Related papers (2023-10-14T03:22:07Z)
Images in Language Space: Exploring the Suitability of Large Language Models for Vision & Language Tasks [17.97052348690598]
Large language models have demonstrated robust performance on various language tasks using zero-shot or few-shot learning paradigms. multimodal models that can additionally handle images as input have yet to catch up in size and generality with language-only models. We make visual information accessible to the language model using separate verbalisation models.
arXiv Detail & Related papers (2023-05-23T07:50:36Z)
Localization vs. Semantics: Visual Representations in Unimodal and Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models. Our empirical observations suggest that vision-and-language models are better at label prediction tasks. We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z)
UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes [91.24112204588353]
We introduce UViM, a unified approach capable of modeling a wide range of computer vision tasks. In contrast to previous models, UViM has the same functional form for all tasks. We demonstrate the effectiveness of UViM on three diverse and challenging vision tasks.
arXiv Detail & Related papers (2022-05-20T17:47:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.