Crowdsourcing Dermatology Images with Google Search Ads: Creating a
Real-World Skin Condition Dataset
- URL: http://arxiv.org/abs/2402.18545v1
- Date: Wed, 28 Feb 2024 18:29:07 GMT
- Title: Crowdsourcing Dermatology Images with Google Search Ads: Creating a
Real-World Skin Condition Dataset
- Authors: Abbi Ward, Jimmy Li, Julie Wang, Sriram Lakshminarasimhan, Ashley
Carrick, Bilson Campana, Jay Hartford, Pradeep Kumar S, Tiya
Tiyasirichokchai, Sunny Virmani, Renee Wong, Yossi Matias, Greg S. Corrado,
Dale R. Webster, Dawn Siegel, Steven Lin, Justin Ko, Alan Karthikesalingam,
Christopher Semturs and Pooja Rao
- Abstract summary: This dataset contains 10,408 images from 5,033 contributions from internet users in the United States over 8 months starting March 2023.
Female (66.72%) and younger (52% age 40) contributors had a higher representation in the dataset compared to the US population.
Dermatologist confidence in assigning a differential diagnosis increased with the number of available variables.
- Score: 7.60704693971239
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Background: Health datasets from clinical sources do not reflect the breadth
and diversity of disease in the real world, impacting research, medical
education, and artificial intelligence (AI) tool development. Dermatology is a
suitable area to develop and test a new and scalable method to create
representative health datasets.
Methods: We used Google Search advertisements to invite contributions to an
open access dataset of images of dermatology conditions, demographic and
symptom information. With informed contributor consent, we describe and release
this dataset containing 10,408 images from 5,033 contributions from internet
users in the United States over 8 months starting March 2023. The dataset
includes dermatologist condition labels as well as estimated Fitzpatrick Skin
Type (eFST) and Monk Skin Tone (eMST) labels for the images.
Results: We received a median of 22 submissions/day (IQR 14-30). Female
(66.72%) and younger (52% < age 40) contributors had a higher representation in
the dataset compared to the US population, and 32.6% of contributors reported a
non-White racial or ethnic identity. Over 97.5% of contributions were genuine
images of skin conditions. Dermatologist confidence in assigning a differential
diagnosis increased with the number of available variables, and showed a weaker
correlation with image sharpness (Spearman's P values <0.001 and 0.01
respectively). Most contributions were short-duration (54% with onset < 7 days
ago ) and 89% were allergic, infectious, or inflammatory conditions. eFST and
eMST distributions reflected the geographical origin of the dataset. The
dataset is available at github.com/google-research-datasets/scin .
Conclusion: Search ads are effective at crowdsourcing images of health
conditions. The SCIN dataset bridges important gaps in the availability of
representative images of common skin conditions.
Related papers
- PASSION for Dermatology: Bridging the Diversity Gap with Pigmented Skin Images from Sub-Saharan Africa [29.405369900938393]
Africa faces a huge shortage of dermatologists, with less than one per million people.
This is in stark contrast to the high demand for dermatologic care, with 80% of the paediatric population suffering from largely untreated skin conditions.
The PASSION project aims to address this issue by collecting images of skin diseases in Sub-Saharan countries with the aim of open-sourcing this data.
arXiv Detail & Related papers (2024-11-07T10:11:37Z) - FairSkin: Fair Diffusion for Skin Disease Image Generation [54.29840149709033]
Diffusion Model (DM) has become a leading method in generating synthetic medical images, but it suffers from a critical twofold bias.
We propose FairSkin, a novel DM framework that mitigates these biases through a three-level resampling mechanism.
Our approach significantly improves the diversity and quality of generated images, contributing to more equitable skin disease detection in clinical settings.
arXiv Detail & Related papers (2024-10-29T21:37:03Z) - A Demographic-Conditioned Variational Autoencoder for fMRI Distribution Sampling and Removal of Confounds [49.34500499203579]
We create a variational autoencoder (VAE)-based model, DemoVAE, to decorrelate fMRI features from demographics.
We generate high-quality synthetic fMRI data based on user-supplied demographics.
arXiv Detail & Related papers (2024-05-13T17:49:20Z) - Using Pre-training and Interaction Modeling for ancestry-specific disease prediction in UK Biobank [69.90493129893112]
Recent genome-wide association studies (GWAS) have uncovered the genetic basis of complex traits, but show an under-representation of non-European descent individuals.
Here, we assess whether we can improve disease prediction across diverse ancestries using multiomic data.
arXiv Detail & Related papers (2024-04-26T16:39:50Z) - The Development and Performance of a Machine Learning Based Mobile
Platform for Visually Determining the Etiology of Penile Pathology [0.0]
We developed a machine-learning model for classifying five penile diseases.
That model is currently in use globally and has the potential to improve access to diagnostic services for penile diseases.
arXiv Detail & Related papers (2024-03-13T11:05:40Z) - DDI-CoCo: A Dataset For Understanding The Effect Of Color Contrast In
Machine-Assisted Skin Disease Detection [51.92255321684027]
We study the interaction between skin tone and color difference effects and suggest that color difference can be an additional reason behind model performance bias between skin tones.
Our work provides a complementary angle to dermatology AI for improving skin disease detection.
arXiv Detail & Related papers (2024-01-24T07:45:24Z) - On the notion of Hallucinations from the lens of Bias and Validity in
Synthetic CXR Images [0.35998666903987897]
Generative models, such as diffusion models, aim to mitigate data quality and clinical information disparities.
At Stanford, researchers explored the utility of a fine-tuned Stable Diffusion model (RoentGen) for medical imaging data augmentation.
We leveraged RoentGen to produce synthetic Chest-XRay (CXR) images and conducted assessments on bias, validity, and hallucinations.
arXiv Detail & Related papers (2023-12-12T04:41:20Z) - Generative models improve fairness of medical classifiers under
distribution shifts [49.10233060774818]
We show that learning realistic augmentations automatically from data is possible in a label-efficient manner using generative models.
We demonstrate that these learned augmentations can surpass ones by making models more robust and statistically fair in- and out-of-distribution.
arXiv Detail & Related papers (2023-04-18T18:15:38Z) - The EMory BrEast imaging Dataset (EMBED): A Racially Diverse, Granular
Dataset of 3.5M Screening and Diagnostic Mammograms [2.243792799100692]
The EMory BrEast imaging dataset contains 3650,000 2D and diagnostic mammograms for 116,000 women divided equally between White and African American patients.
Our goal is to share this dataset with research partners to aid in development and validation of breast AI models that will serve all patients fairly and help decrease bias in medical AI.
arXiv Detail & Related papers (2022-02-08T14:40:59Z) - Disparities in Dermatology AI: Assessments Using Diverse Clinical Images [9.767299882513825]
We show that state-of-the-art dermatology AI models perform substantially worse on Diverse Dermatology Images dataset.
We find that dark skin tones and uncommon diseases, which are well represented in the DDI dataset, lead to performance drop-offs.
arXiv Detail & Related papers (2021-11-15T07:04:58Z) - Deep learning-based COVID-19 pneumonia classification using chest CT
images: model generalizability [54.86482395312936]
Deep learning (DL) classification models were trained to identify COVID-19-positive patients on 3D computed tomography (CT) datasets from different countries.
We trained nine identical DL-based classification models by using combinations of the datasets with a 72% train, 8% validation, and 20% test data split.
The models trained on multiple datasets and evaluated on a test set from one of the datasets used for training performed better.
arXiv Detail & Related papers (2021-02-18T21:14:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.