Consensus and Subjectivity of Skin Tone Annotation for ML Fairness
- URL: http://arxiv.org/abs/2305.09073v3
- Date: Tue, 2 Jan 2024 20:53:03 GMT
- Title: Consensus and Subjectivity of Skin Tone Annotation for ML Fairness
- Authors: Candice Schumann, Gbolahan O. Olanubi, Auriel Wright, Ellis Monk Jr.,
Courtney Heldreth, Susanna Ricco
- Abstract summary: We release the Monk Skin Tone Examples (MST-E) dataset, containing 1515 images and 31 videos spread across the full MST scale.
Our study shows that annotators can reliably annotate skin tone in a way that aligns with an expert in the MST scale, even under challenging environmental conditions.
We advise practitioners to use a diverse set of annotators and a higher replication count for each image when annotating skin tone for fairness research.
- Score: 1.0728297108232812
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding different human attributes and how they affect model behavior
may become a standard need for all model creation and usage, from traditional
computer vision tasks to the newest multimodal generative AI systems. In
computer vision specifically, we have relied on datasets augmented with
perceived attribute signals (e.g., gender presentation, skin tone, and age) and
benchmarks enabled by these datasets. Typically labels for these tasks come
from human annotators. However, annotating attribute signals, especially skin
tone, is a difficult and subjective task. Perceived skin tone is affected by
technical factors, like lighting conditions, and social factors that shape an
annotator's lived experience. This paper examines the subjectivity of skin tone
annotation through a series of annotation experiments using the Monk Skin Tone
(MST) scale, a small pool of professional photographers, and a much larger pool
of trained crowdsourced annotators. Along with this study we release the Monk
Skin Tone Examples (MST-E) dataset, containing 1515 images and 31 videos spread
across the full MST scale. MST-E is designed to help train human annotators to
annotate MST effectively. Our study shows that annotators can reliably annotate
skin tone in a way that aligns with an expert in the MST scale, even under
challenging environmental conditions. We also find evidence that annotators
from different geographic regions rely on different mental models of MST
categories resulting in annotations that systematically vary across regions.
Given this, we advise practitioners to use a diverse set of annotators and a
higher replication count for each image when annotating skin tone for fairness
research.
Related papers
- Are generative models fair? A study of racial bias in dermatological image generation [15.812312064457865]
We evaluate the fairness of generative models in clinical dermatology with respect to racial bias.
We utilize the Fitzpatrick17k dataset to examine how racial bias influences the representation and performance of these models.
arXiv Detail & Related papers (2025-01-20T21:24:15Z) - Colorimetric skin tone scale for improved accuracy and reduced perceptual bias of human skin tone annotations [0.0]
We develop a novel Colorimetric Skin Tone (CST) scale based on prior colorimetric measurements.
Using experiments requiring humans to rate their own skin tone and the skin tone of subjects in images, we show that the new CST scale is more sensitive, consistent, and colorimetrically accurate.
arXiv Detail & Related papers (2024-10-28T13:29:24Z) - A Multimodal Automated Interpretability Agent [63.8551718480664]
MAIA is a system that uses neural models to automate neural model understanding tasks.
We first characterize MAIA's ability to describe (neuron-level) features in learned representations of images.
We then show that MAIA can aid in two additional interpretability tasks: reducing sensitivity to spurious features, and automatically identifying inputs likely to be mis-classified.
arXiv Detail & Related papers (2024-04-22T17:55:11Z) - DDI-CoCo: A Dataset For Understanding The Effect Of Color Contrast In
Machine-Assisted Skin Disease Detection [51.92255321684027]
We study the interaction between skin tone and color difference effects and suggest that color difference can be an additional reason behind model performance bias between skin tones.
Our work provides a complementary angle to dermatology AI for improving skin disease detection.
arXiv Detail & Related papers (2024-01-24T07:45:24Z) - Q-Instruct: Improving Low-level Visual Abilities for Multi-modality
Foundation Models [81.20804369985376]
We conduct a large-scale subjective experiment collecting a vast number of real human feedbacks on low-level vision.
The constructed **Q-Pathway** dataset includes 58K detailed human feedbacks on 18,973 images.
We design a GPT-participated conversion to process these feedbacks into diverse-format 200K instruction-response pairs.
arXiv Detail & Related papers (2023-11-12T09:10:51Z) - FACET: Fairness in Computer Vision Evaluation Benchmark [21.862644380063756]
Computer vision models have known performance disparities across attributes such as gender and skin tone.
We present a new benchmark named FACET (FAirness in Computer Vision EvaluaTion)
FACET is a large, publicly available evaluation set of 32k images for some of the most common vision tasks.
arXiv Detail & Related papers (2023-08-31T17:59:48Z) - DALL-Eval: Probing the Reasoning Skills and Social Biases of
Text-to-Image Generation Models [73.12069620086311]
We investigate the visual reasoning capabilities and social biases of text-to-image models.
First, we measure three visual reasoning skills: object recognition, object counting, and spatial relation understanding.
Second, we assess the gender and skin tone biases by measuring the gender/skin tone distribution of generated images.
arXiv Detail & Related papers (2022-02-08T18:36:52Z) - A Comprehensive Study of Image Classification Model Sensitivity to
Foregrounds, Backgrounds, and Visual Attributes [58.633364000258645]
We call this dataset RIVAL10 consisting of roughly $26k$ instances over $10$ classes.
We evaluate the sensitivity of a broad set of models to noise corruptions in foregrounds, backgrounds and attributes.
In our analysis, we consider diverse state-of-the-art architectures (ResNets, Transformers) and training procedures (CLIP, SimCLR, DeiT, Adversarial Training)
arXiv Detail & Related papers (2022-01-26T06:31:28Z) - Multi-Task Self-Training for Learning General Representations [97.01728635294879]
Multi-task self-training (MuST) harnesses the knowledge in independent specialized teacher models to train a single general student model.
MuST is scalable with unlabeled or partially labeled datasets and outperforms both specialized supervised models and self-supervised models when training on large scale datasets.
arXiv Detail & Related papers (2021-08-25T17:20:50Z) - Reliability and Validity of Image-Based and Self-Reported Skin Phenotype
Metrics [0.0]
We show that measures of skin-tone for biometric performance evaluations must come from objective, characterized, and controlled sources.
Results demonstrate that measures of skin-tone for biometric performance evaluations must come from objective, characterized, and controlled sources.
arXiv Detail & Related papers (2021-06-18T16:12:24Z) - Towards measuring fairness in AI: the Casual Conversations dataset [9.246092246471955]
Our dataset is composed of 3,011 subjects and contains over 45,000 videos, with an average of 15 videos per person.
The videos were recorded in multiple U.S. states with a diverse set of adults in various age, gender and apparent skin tone groups.
arXiv Detail & Related papers (2021-04-06T22:48:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.