A Unified Framework and Dataset for Assessing Societal Bias in Vision-Language Models
- URL: http://arxiv.org/abs/2402.13636v2
- Date: Mon, 17 Jun 2024 08:35:35 GMT
- Title: A Unified Framework and Dataset for Assessing Societal Bias in Vision-Language Models
- Authors: Ashutosh Sathe, Prachi Jain, Sunayana Sitaram,
- Abstract summary: We propose a unified framework for evaluating gender, race, and age biases in vision-language models (VLMs)
We generate high-quality synthetic datasets that intentionally conceal gender, race, and age information across different professional domains.
The dataset includes action-based descriptions of each profession and serves as a benchmark for evaluating societal biases in vision-language models (VLMs)
- Score: 9.025958469582363
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-language models (VLMs) have gained widespread adoption in both industry and academia. In this study, we propose a unified framework for systematically evaluating gender, race, and age biases in VLMs with respect to professions. Our evaluation encompasses all supported inference modes of the recent VLMs, including image-to-text, text-to-text, text-to-image, and image-to-image. Additionally, we propose an automated pipeline to generate high-quality synthetic datasets that intentionally conceal gender, race, and age information across different professional domains, both in generated text and images. The dataset includes action-based descriptions of each profession and serves as a benchmark for evaluating societal biases in vision-language models (VLMs). In our comparative analysis of widely used VLMs, we have identified that varying input-output modalities lead to discernible differences in bias magnitudes and directions. Additionally, we find that VLM models exhibit distinct biases across different bias attributes we investigated. We hope our work will help guide future progress in improving VLMs to learn socially unbiased representations. We will release our data and code.
Related papers
- VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM)
VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety.
Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z) - GABInsight: Exploring Gender-Activity Binding Bias in Vision-Language Models [3.018378575149671]
We show that vision-language models (VLMs) are biased towards identifying the individual with the expected gender as the performer of the activity.
We refer to this bias in associating an activity with the gender of its actual performer in an image or text as the Gender-Activity Binding (GAB) bias.
Our experiments indicate that VLMs experience an average performance decline of about 13.2% when confronted with gender-activity binding bias.
arXiv Detail & Related papers (2024-07-30T17:46:06Z) - GenderBias-\emph{VL}: Benchmarking Gender Bias in Vision Language Models via Counterfactual Probing [72.0343083866144]
This paper introduces the GenderBias-emphVL benchmark to evaluate occupation-related gender bias in Large Vision-Language Models.
Using our benchmark, we extensively evaluate 15 commonly used open-source LVLMs and state-of-the-art commercial APIs.
Our findings reveal widespread gender biases in existing LVLMs.
arXiv Detail & Related papers (2024-06-30T05:55:15Z) - Towards Vision-Language Geo-Foundation Model: A Survey [65.70547895998541]
Vision-Language Foundation Models (VLFMs) have made remarkable progress on various multimodal tasks.
This paper thoroughly reviews VLGFMs, summarizing and analyzing recent developments in the field.
arXiv Detail & Related papers (2024-06-13T17:57:30Z) - Uncovering Bias in Large Vision-Language Models at Scale with Counterfactuals [8.41410889524315]
We study the social biases contained in text generated by Large Vision-Language Models (LVLMs)
We present LVLMs with identical open-ended text prompts while conditioning on images from different counterfactual sets.
We evaluate the text produced by different models under this counterfactual generation setting at scale, producing over 57 million responses from popular LVLMs.
arXiv Detail & Related papers (2024-05-30T15:27:56Z) - Bridge the Modality and Capability Gaps in Vision-Language Model Selection [62.26769826687365]
Vision Language Models (VLMs) excel in zero-shot image classification by pairing images with textual category names.
To better reuse the VLM resource, a promising strategy is selecting appropriate Pre-Trained VLMs from the VLM Zoo.
We analyze two inherent challenges in assessing the ability of a VLM in this Language-Only VLM selection.
We propose VLM Selection With gAp Bridging to mitigate the negative impact of two gaps.
arXiv Detail & Related papers (2024-03-20T17:54:58Z) - Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions [11.786387517781328]
Vision-Language Models (VLMs) are advanced models that can tackle more intricate tasks such as image captioning and visual question answering.
Our classification organizes VLMs into three distinct categories: models dedicated to vision-language understanding, models that process multimodal inputs to generate unimodal (textual) outputs and models that both accept and produce multimodal inputs and outputs.
We meticulously dissect each model, offering an extensive analysis of its foundational architecture, training data sources, as well as its strengths and limitations wherever possible.
arXiv Detail & Related papers (2024-02-20T18:57:34Z) - Vision-Language Models for Vision Tasks: A Survey [62.543250338410836]
Vision-Language Models (VLMs) learn rich vision-language correlation from web-scale image-text pairs.
This paper provides a systematic review of visual language models for various visual recognition tasks.
arXiv Detail & Related papers (2023-04-03T02:17:05Z) - DeAR: Debiasing Vision-Language Models with Additive Residuals [5.672132510411465]
Large pre-trained vision-language models (VLMs) provide rich, adaptable image and text representations.
These models suffer from societal biases owing to the skewed distribution of various identity groups in the training data.
We present DeAR, a novel debiasing method that learns additive residual image representations to offset the original representations.
arXiv Detail & Related papers (2023-03-18T14:57:43Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.