Vision-Language Models under Cultural and Inclusive Considerations
- URL: http://arxiv.org/abs/2407.06177v1
- Date: Mon, 8 Jul 2024 17:50:00 GMT
- Title: Vision-Language Models under Cultural and Inclusive Considerations
- Authors: Antonia Karamolegkou, Phillip Rust, Yong Cao, Ruixiang Cui, Anders Søgaard, Daniel Hershcovich,
- Abstract summary: Large vision-language models (VLMs) can assist visually impaired people by describing images from their daily lives.
Current evaluation datasets may not reflect diverse cultural user backgrounds or the situational context of this use case.
We create a survey to determine caption preferences and propose a culture-centric evaluation benchmark by filtering VizWiz, an existing dataset with images taken by people who are blind.
We then evaluate several VLMs, investigating their reliability as visual assistants in a culturally diverse setting.
- Score: 53.614528867159706
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large vision-language models (VLMs) can assist visually impaired people by describing images from their daily lives. Current evaluation datasets may not reflect diverse cultural user backgrounds or the situational context of this use case. To address this problem, we create a survey to determine caption preferences and propose a culture-centric evaluation benchmark by filtering VizWiz, an existing dataset with images taken by people who are blind. We then evaluate several VLMs, investigating their reliability as visual assistants in a culturally diverse setting. While our results for state-of-the-art models are promising, we identify challenges such as hallucination and misalignment of automatic evaluation metrics with human judgment. We make our survey, data, code, and model outputs publicly available.
Related papers
- VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM)
VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety.
Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z) - Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - Towards Geographic Inclusion in the Evaluation of Text-to-Image Models [25.780536950323683]
We study how much annotators in Africa, Europe, and Southeast Asia vary in their perception of geographic representation, visual appeal, and consistency in real and generated images.
For example, annotators in different locations often disagree on whether exaggerated, stereotypical depictions of a region are considered geographically representative.
We recommend steps for improved automatic and human evaluations.
arXiv Detail & Related papers (2024-05-07T16:23:06Z) - VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs.
Existing benchmarks are often limited in scope, focusing mainly on object hallucinations.
We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z) - Evaluating Image Review Ability of Vision Language Models [25.846728716526766]
This paper explores the use of large-scale vision language models (LVLMs) to generate review texts for images.
The ability of LVLMs to review images is not fully understood, highlighting the need for a methodical evaluation of their review abilities.
arXiv Detail & Related papers (2024-02-19T13:16:10Z) - FACET: Fairness in Computer Vision Evaluation Benchmark [21.862644380063756]
Computer vision models have known performance disparities across attributes such as gender and skin tone.
We present a new benchmark named FACET (FAirness in Computer Vision EvaluaTion)
FACET is a large, publicly available evaluation set of 32k images for some of the most common vision tasks.
arXiv Detail & Related papers (2023-08-31T17:59:48Z) - DeAR: Debiasing Vision-Language Models with Additive Residuals [5.672132510411465]
Large pre-trained vision-language models (VLMs) provide rich, adaptable image and text representations.
These models suffer from societal biases owing to the skewed distribution of various identity groups in the training data.
We present DeAR, a novel debiasing method that learns additive residual image representations to offset the original representations.
arXiv Detail & Related papers (2023-03-18T14:57:43Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z) - Ambiguous Images With Human Judgments for Robust Visual Event
Classification [34.62731821199598]
We create datasets of ambiguous images and use them to produce SQUID-E ("Squidy"), a collection of noisy images extracted from videos.
All images are annotated with ground truth values and a test set is annotated with human uncertainty judgments.
We use this dataset to characterize human uncertainty in vision tasks and evaluate existing visual event classification models.
arXiv Detail & Related papers (2022-10-06T17:52:20Z) - Exploring CLIP for Assessing the Look and Feel of Images [87.97623543523858]
We introduce Contrastive Language-Image Pre-training (CLIP) models for assessing both the quality perception (look) and abstract perception (feel) of images in a zero-shot manner.
Our results show that CLIP captures meaningful priors that generalize well to different perceptual assessments.
arXiv Detail & Related papers (2022-07-25T17:58:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.