The Impact of Skin Tone Label Granularity on the Performance and Fairness of AI Based Dermatology Image Classification Models
- URL: http://arxiv.org/abs/2509.11184v1
- Date: Sun, 14 Sep 2025 09:30:24 GMT
- Title: The Impact of Skin Tone Label Granularity on the Performance and Fairness of AI Based Dermatology Image Classification Models
- Authors: Partha Shah, Durva Sankhe, Maariyah Rashid, Zakaa Khaled, Esther Puyol-Antón, Tiarna Lee, Maram Alqarni, Sweta Rai, Andrew P. King,
- Abstract summary: The Fitzpatrick Skin Tone (FST) scale has been criticised for having greater granularity in its skin tone categories for lighter-skinned subjects.<n>This paper conducts an investigation of the impact (on performance and bias) on AI classification models of granularity in the FST scale.
- Score: 0.8590210863443347
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Artificial intelligence (AI) models to automatically classify skin lesions from dermatology images have shown promising performance but also susceptibility to bias by skin tone. The most common way of representing skin tone information is the Fitzpatrick Skin Tone (FST) scale. The FST scale has been criticised for having greater granularity in its skin tone categories for lighter-skinned subjects. This paper conducts an investigation of the impact (on performance and bias) on AI classification models of granularity in the FST scale. By training multiple AI models to classify benign vs. malignant lesions using FST-specific data of differing granularity, we show that: (i) when training models using FST-specific data based on three groups (FST 1/2, 3/4 and 5/6), performance is generally better for models trained on FST-specific data compared to a general model trained on FST-balanced data; (ii) reducing the granularity of FST scale information (from 1/2 and 3/4 to 1/2/3/4) can have a detrimental effect on performance. Our results highlight the importance of the granularity of FST groups when training lesion classification models. Given the question marks over possible human biases in the choice of categories in the FST scale, this paper provides evidence for a move away from the FST scale in fair AI research and a transition to an alternative scale that better represents the diversity of human skin tones.
Related papers
- Small-to-Large Generalization: Data Influences Models Consistently Across Scale [76.87199303408161]
We find that small- and large-scale language model predictions (generally) do highly correlate across choice of training data.<n>We also characterize how proxy scale affects effectiveness in two downstream proxy model applications: data attribution and dataset selection.
arXiv Detail & Related papers (2025-05-22T05:50:19Z) - Consensus and Subjectivity of Skin Tone Annotation for ML Fairness [1.0728297108232812]
We release the Monk Skin Tone Examples (MST-E) dataset, containing 1515 images and 31 videos spread across the full MST scale.
Our study shows that annotators can reliably annotate skin tone in a way that aligns with an expert in the MST scale, even under challenging environmental conditions.
We advise practitioners to use a diverse set of annotators and a higher replication count for each image when annotating skin tone for fairness research.
arXiv Detail & Related papers (2023-05-16T00:03:09Z) - Analyzing Bias in Diffusion-based Face Generation Models [75.80072686374564]
Diffusion models are increasingly popular in synthetic data generation and image editing applications.
We investigate the presence of bias in diffusion-based face generation models with respect to attributes such as gender, race, and age.
We examine how dataset size affects the attribute composition and perceptual quality of both diffusion and Generative Adversarial Network (GAN) based face generation models.
arXiv Detail & Related papers (2023-05-10T18:22:31Z) - Interpretable Classification of Early Stage Parkinson's Disease from EEG [0.6597195879147557]
This paper introduces a novel approach to detecting Parkinson's Disease in its early stages using EEG data.
The hypothesis is that this representation captures essential information from the noisy EEG signal, improving disease detection.
Statistical features extracted from this representation are utilised as input for interpretable machine learning models.
In Future, these models could be deployed in the real world - the results presented in this paper indicate that more than 3 in 4 early-stage Parkinson's cases would be captured with our pipeline.
arXiv Detail & Related papers (2023-01-20T16:11:02Z) - FairDisCo: Fairer AI in Dermatology via Disentanglement Contrastive
Learning [11.883809920936619]
We propose FairDisCo, a disentanglement deep learning framework with contrastive learning.
We compare FairDisCo to three fairness methods, namely, resampling, reweighting, and attribute-aware.
We adapt two fairness-based metrics DPM and EOM for our multiple classes and sensitive attributes task, highlighting the skin-type bias in skin lesion classification.
arXiv Detail & Related papers (2022-08-22T01:54:23Z) - SuperCon: Supervised Contrastive Learning for Imbalanced Skin Lesion
Classification [9.265557367859637]
SuperCon is a two-stage training strategy to overcome the class imbalance problem on skin lesion classification.
Our two-stage training strategy effectively addresses the class imbalance classification problem, and significantly improves existing works in terms of F1-score and AUC score.
arXiv Detail & Related papers (2022-02-11T15:19:36Z) - FairIF: Boosting Fairness in Deep Learning via Influence Functions with
Validation Set Sensitive Attributes [51.02407217197623]
We propose a two-stage training algorithm named FAIRIF.
It minimizes the loss over the reweighted data set where the sample weights are computed.
We show that FAIRIF yields models with better fairness-utility trade-offs against various types of bias.
arXiv Detail & Related papers (2022-01-15T05:14:48Z) - Reliability and Validity of Image-Based and Self-Reported Skin Phenotype
Metrics [0.0]
We show that measures of skin-tone for biometric performance evaluations must come from objective, characterized, and controlled sources.
Results demonstrate that measures of skin-tone for biometric performance evaluations must come from objective, characterized, and controlled sources.
arXiv Detail & Related papers (2021-06-18T16:12:24Z) - From Sound Representation to Model Robustness [82.21746840893658]
We investigate the impact of different standard environmental sound representations (spectrograms) on the recognition performance and adversarial attack robustness of a victim residual convolutional neural network.
Averaged over various experiments on three environmental sound datasets, we found the ResNet-18 model outperforms other deep learning architectures.
arXiv Detail & Related papers (2020-07-27T17:30:49Z) - Alleviating the Incompatibility between Cross Entropy Loss and Episode
Training for Few-shot Skin Disease Classification [76.89093364969253]
We propose to apply Few-Shot Learning to skin disease identification to address the extreme scarcity of training sample problem.
Based on a detailed analysis, we propose the Query-Relative (QR) loss, which proves superior to Cross Entropy (CE) under episode training.
We further strengthen the proposed QR loss with a novel adaptive hard margin strategy.
arXiv Detail & Related papers (2020-04-21T00:57:11Z) - Adversarial Feature Hallucination Networks for Few-Shot Learning [84.31660118264514]
Adversarial Feature Hallucination Networks (AFHN) is based on conditional Wasserstein Generative Adversarial networks (cWGAN)
Two novel regularizers are incorporated into AFHN to encourage discriminability and diversity of the synthesized features.
arXiv Detail & Related papers (2020-03-30T02:43:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.