Vision-Language Models Performing Zero-Shot Tasks Exhibit Gender-based
Disparities
- URL: http://arxiv.org/abs/2301.11100v1
- Date: Thu, 26 Jan 2023 13:44:31 GMT
- Title: Vision-Language Models Performing Zero-Shot Tasks Exhibit Gender-based
Disparities
- Authors: Melissa Hall, Laura Gustafson, Aaron Adcock, Ishan Misra, Candace Ross
- Abstract summary: We explore the extent to which zero-shot vision-language models exhibit gender bias for different vision tasks.
We evaluate different vision-language models with multiple datasets across a set of concepts.
- Score: 19.03751960721954
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We explore the extent to which zero-shot vision-language models exhibit
gender bias for different vision tasks. Vision models traditionally required
task-specific labels for representing concepts, as well as finetuning;
zero-shot models like CLIP instead perform tasks with an open-vocabulary,
meaning they do not need a fixed set of labels, by using text embeddings to
represent concepts. With these capabilities in mind, we ask: Do vision-language
models exhibit gender bias when performing zero-shot image classification,
object detection and semantic segmentation? We evaluate different
vision-language models with multiple datasets across a set of concepts and find
(i) all models evaluated show distinct performance differences based on the
perceived gender of the person co-occurring with a given concept in the image
and that aggregating analyses over all concepts can mask these concerns; (ii)
model calibration (i.e. the relationship between accuracy and confidence) also
differs distinctly by perceived gender, even when evaluating on similar
representations of concepts; and (iii) these observed disparities align with
existing gender biases in word embeddings from language models. These findings
suggest that, while language greatly expands the capability of vision tasks, it
can also contribute to social biases in zero-shot vision settings. Furthermore,
biases can further propagate when foundational models like CLIP are used by
other models to enable zero-shot capabilities.
Related papers
- When Does Perceptual Alignment Benefit Vision Representations? [76.32336818860965]
We investigate how aligning vision model representations to human perceptual judgments impacts their usability.
We find that aligning models to perceptual judgments yields representations that improve upon the original backbones across many downstream tasks.
Our results suggest that injecting an inductive bias about human perceptual knowledge into vision models can contribute to better representations.
arXiv Detail & Related papers (2024-10-14T17:59:58Z) - VisoGender: A dataset for benchmarking gender bias in image-text pronoun
resolution [80.57383975987676]
VisoGender is a novel dataset for benchmarking gender bias in vision-language models.
We focus on occupation-related biases within a hegemonic system of binary gender, inspired by Winograd and Winogender schemas.
We benchmark several state-of-the-art vision-language models and find that they demonstrate bias in resolving binary gender in complex scenes.
arXiv Detail & Related papers (2023-06-21T17:59:51Z) - DeAR: Debiasing Vision-Language Models with Additive Residuals [5.672132510411465]
Large pre-trained vision-language models (VLMs) provide rich, adaptable image and text representations.
These models suffer from societal biases owing to the skewed distribution of various identity groups in the training data.
We present DeAR, a novel debiasing method that learns additive residual image representations to offset the original representations.
arXiv Detail & Related papers (2023-03-18T14:57:43Z) - Auditing Gender Presentation Differences in Text-to-Image Models [54.16959473093973]
We study how gender is presented differently in text-to-image models.
By probing gender indicators in the input text, we quantify the frequency differences of presentation-centric attributes.
We propose an automatic method to estimate such differences.
arXiv Detail & Related papers (2023-02-07T18:52:22Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z) - Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery.
We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z) - DALL-Eval: Probing the Reasoning Skills and Social Biases of
Text-to-Image Generation Models [73.12069620086311]
We investigate the visual reasoning capabilities and social biases of text-to-image models.
First, we measure three visual reasoning skills: object recognition, object counting, and spatial relation understanding.
Second, we assess the gender and skin tone biases by measuring the gender/skin tone distribution of generated images.
arXiv Detail & Related papers (2022-02-08T18:36:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.