Adapting Large Language Models to Mitigate Skin Tone Biases in Clinical Dermatology Tasks: A Mixed-Methods Study
- URL: http://arxiv.org/abs/2510.00055v2
- Date: Tue, 07 Oct 2025 09:41:10 GMT
- Title: Adapting Large Language Models to Mitigate Skin Tone Biases in Clinical Dermatology Tasks: A Mixed-Methods Study
- Authors: Kiran Nijjer, Ryan Bui, Derek Jiu, Adnan Ahmed, Peter Wang, Kevin Zhu, Lilly Zhu,
- Abstract summary: We evaluated performance biases in SkinGPT-4 across skin tones on common skin diseases.<n>We leveraged the SkinGPT-4 backbone to develop finetuned models for custom skin disease classification tasks.
- Score: 2.3034630097498883
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: SkinGPT-4, a large vision-language model, leverages annotated skin disease images to augment clinical workflows in underserved communities. However, its training dataset predominantly represents lighter skin tones, limiting diagnostic accuracy for darker tones. Here, we evaluated performance biases in SkinGPT-4 across skin tones on common skin diseases, including eczema, allergic-contact dermatitis, and psoriasis using the open-sourced SCIN dataset. We leveraged the SkinGPT-4 backbone to develop finetuned models for custom skin disease classification tasks and explored bias mitigation strategies. Clinical evaluation by board-certified dermatologists on six relevant skin diseases from 300 SCIN cases assessed images for diagnostic accuracy, informativity, physician utility, and patient utility. Model fairness metrics, including demographic parity and equalized odds, were calculated across skin tones. SkinGPT-4 achieved an average demographic parity of 0.10 across Fitzpatrick types, with notable differences of 0.10-0.15 between lightest and darkest tones across evaluation metrics. Model hallucinations in artifacts and anatomy occurred at a rate of 17.8. Our customized models achieved average F1, precision, and AUROC of 0.75, 0.78, and 0.78 across visually similar disease pairs. Fairness analysis showed an average demographic parity of 0.75, with a maximum disparity of 0.21 across skin tones. The best model achieved parity scores of 0.83, 0.83, 0.76, 0.89, 0.90, and 0.90 for Fitzpatrick I-VI, indicating robust fairness. Large language models such as SkinGPT-4 showed weaker performance on darker tones. Model biases exist across evaluation criteria, and hallucinations may affect diagnostic efficacy. These findings demonstrate the efficacy of training accurate, fair models using existing backbones for custom skin disease classification.
Related papers
- Colorimeter-Supervised Skin Tone Estimation from Dermatoscopic Images for Fairness Auditing [0.0]
We develop neural networks that predict Fitzpatrick skin type via ordinal regression and the Individual Typology Angle (ITA) via color regression.<n>We release code and pretrained models as an open-source tool for rapid skin-tone annotation and bias auditing.<n>This is, to our knowledge, the first dermatoscopic skin-tone estimation neural network validated against colorimeter measurements.
arXiv Detail & Related papers (2026-02-10T20:20:45Z) - Explainable Admission-Level Predictive Modeling for Prolonged Hospital Stay in Elderly Populations: Challenges in Low- and Middle-Income Countries [65.4286079244589]
Prolonged length of stay (pLoS) is a significant factor associated with the risk of adverse in-hospital events.<n>We develop and explain a predictive model for pLos using admission-level patient and hospital administrative data.
arXiv Detail & Related papers (2026-01-07T23:35:24Z) - TrustSkin: A Fairness Pipeline for Trustworthy Facial Affect Analysis Across Skin Tone [4.847470451539328]
This study compares two objective skin tone classification methods: the widely used Individual Typology Angle (ITA) and a perceptually grounded alternative based on Lightness ($L*$) and Hue ($H*$)<n>Using AffectNet and a MobileNet-based model, we assess fairness across skin tone groups defined by each method.
arXiv Detail & Related papers (2025-05-27T02:31:08Z) - Are generative models fair? A study of racial bias in dermatological image generation [15.812312064457865]
We evaluate the fairness of generative models in clinical dermatology with respect to racial bias.<n>We utilize the Fitzpatrick17k dataset to examine how racial bias influences the representation and performance of these models.
arXiv Detail & Related papers (2025-01-20T21:24:15Z) - FairSkin: Fair Diffusion for Skin Disease Image Generation [54.29840149709033]
Diffusion Model (DM) has become a leading method in generating synthetic medical images, but it suffers from a critical twofold bias.
We propose FairSkin, a novel DM framework that mitigates these biases through a three-level resampling mechanism.
Our approach significantly improves the diversity and quality of generated images, contributing to more equitable skin disease detection in clinical settings.
arXiv Detail & Related papers (2024-10-29T21:37:03Z) - Evaluating Machine Learning-based Skin Cancer Diagnosis [0.0]
The research assesses two convolutional neural network architectures: a MobileNet-based model and a custom CNN model.
Both models are evaluated for their ability to classify skin lesions into seven categories and to distinguish between dangerous and benign lesions.
The study concludes that while the models show promise in explainability, further development is needed to ensure fairness across different skin tones.
arXiv Detail & Related papers (2024-09-04T02:44:48Z) - Using Pre-training and Interaction Modeling for ancestry-specific disease prediction in UK Biobank [69.90493129893112]
Recent genome-wide association studies (GWAS) have uncovered the genetic basis of complex traits, but show an under-representation of non-European descent individuals.
Here, we assess whether we can improve disease prediction across diverse ancestries using multiomic data.
arXiv Detail & Related papers (2024-04-26T16:39:50Z) - DDI-CoCo: A Dataset For Understanding The Effect Of Color Contrast In
Machine-Assisted Skin Disease Detection [51.92255321684027]
We study the interaction between skin tone and color difference effects and suggest that color difference can be an additional reason behind model performance bias between skin tones.
Our work provides a complementary angle to dermatology AI for improving skin disease detection.
arXiv Detail & Related papers (2024-01-24T07:45:24Z) - How Does Pruning Impact Long-Tailed Multi-Label Medical Image
Classifiers? [49.35105290167996]
Pruning has emerged as a powerful technique for compressing deep neural networks, reducing memory usage and inference time without significantly affecting overall performance.
This work represents a first step toward understanding the impact of pruning on model behavior in deep long-tailed, multi-label medical image classification.
arXiv Detail & Related papers (2023-08-17T20:40:30Z) - Generative models improve fairness of medical classifiers under
distribution shifts [49.10233060774818]
We show that learning realistic augmentations automatically from data is possible in a label-efficient manner using generative models.
We demonstrate that these learned augmentations can surpass ones by making models more robust and statistically fair in- and out-of-distribution.
arXiv Detail & Related papers (2023-04-18T18:15:38Z) - EdgeMixup: Improving Fairness for Skin Disease Classification and
Segmentation [9.750368551427494]
Skin lesions can be an early indicator of a wide range of infectious and other diseases.
The use of deep learning (DL) models to diagnose skin lesions has great potential in assisting clinicians with prescreening patients.
These models often learn biases inherent in training data, which can lead to a performance gap in the diagnosis of people with light and/or dark skin tones.
arXiv Detail & Related papers (2022-02-28T15:33:31Z) - Deep learning-based COVID-19 pneumonia classification using chest CT
images: model generalizability [54.86482395312936]
Deep learning (DL) classification models were trained to identify COVID-19-positive patients on 3D computed tomography (CT) datasets from different countries.
We trained nine identical DL-based classification models by using combinations of the datasets with a 72% train, 8% validation, and 20% test data split.
The models trained on multiple datasets and evaluated on a test set from one of the datasets used for training performed better.
arXiv Detail & Related papers (2021-02-18T21:14:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.