Investigating the Quality of DermaMNIST and Fitzpatrick17k Dermatological Image Datasets
- URL: http://arxiv.org/abs/2401.14497v2
- Date: Mon, 03 Feb 2025 04:30:57 GMT
- Title: Investigating the Quality of DermaMNIST and Fitzpatrick17k Dermatological Image Datasets
- Authors: Kumar Abhishek, Aditi Jain, Ghassan Hamarneh,
- Abstract summary: Several factors can impact data quality, such as the presence of duplicates, data leakage across train-test partitions, mislabeled images, and the absence of a well-defined test partition.
We conduct meticulous analyses of three popular dermatological image datasets: DermaMNIST, its source HAM10000, and Fitzpatrick17k.
- Score: 17.01966057343415
- License:
- Abstract: The remarkable progress of deep learning in dermatological tasks has brought us closer to achieving diagnostic accuracies comparable to those of human experts. However, while large datasets play a crucial role in the development of reliable deep neural network models, the quality of data therein and their correct usage are of paramount importance. Several factors can impact data quality, such as the presence of duplicates, data leakage across train-test partitions, mislabeled images, and the absence of a well-defined test partition. In this paper, we conduct meticulous analyses of three popular dermatological image datasets: DermaMNIST, its source HAM10000, and Fitzpatrick17k, uncovering these data quality issues, measure the effects of these problems on the benchmark results, and propose corrections to the datasets. Besides ensuring the reproducibility of our analysis, by making our analysis pipeline and the accompanying code publicly available, we aim to encourage similar explorations and to facilitate the identification and addressing of potential data quality issues in other large datasets.
Related papers
- Human Body Restoration with One-Step Diffusion Model and A New Benchmark [74.66514054623669]
We propose a high-quality dataset automated cropping and filtering (HQ-ACF) pipeline.
This pipeline leverages existing object detection datasets and other unlabeled images to automatically crop and filter high-quality human images.
We also propose emphOSDHuman, a novel one-step diffusion model for human body restoration.
arXiv Detail & Related papers (2025-02-03T14:48:40Z) - Towards Understanding the Impact of Data Bugs on Deep Learning Models in Software Engineering [13.17302533571231]
Deep learning (DL) systems are prone to bugs from many sources, including training data.
Existing literature suggests that bugs in training data are highly prevalent.
We investigate three types of data prevalent in software engineering tasks: code-based, text-based, and metric-based.
arXiv Detail & Related papers (2024-11-19T00:28:20Z) - A Guide to Misinformation Detection Datasets [5.673951146506489]
This guide aims to provide a roadmap for obtaining higher quality data and conducting more effective evaluations.
All datasets and other artifacts are available at https://misinfo-datasets.complexdatalab.com/.
arXiv Detail & Related papers (2024-11-07T18:47:39Z) - Is Dataset Quality Still a Concern in Diagnosis Using Large Foundation Model? [33.71784955496207]
An LFM has been developed for fundus images using the Vision Transformer (VIT) and a self-supervised learning framework.
To investigate the influence of data quality on LFM, we conducted explorations in two fundus diagnosis tasks using datasets of varying quality.
Our investigation found that LFM exhibits greater resilience to dataset quality issues, including image quality and dataset bias, compared to typical convolutional networks.
arXiv Detail & Related papers (2024-05-21T08:27:35Z) - Copycats: the many lives of a publicly available medical imaging dataset [12.98380178359767]
Medical Imaging (MI) datasets are fundamental to artificial intelligence in healthcare.
MI datasets used to be proprietary, but have become increasingly available to the public, including on community-contributed platforms (CCPs) like Kaggle or HuggingFace.
While open data is important to enhance the redistribution of data's public value, we find that the current CCP governance model fails to uphold the quality needed and recommended practices for sharing, documenting, and evaluating datasets.
arXiv Detail & Related papers (2024-02-09T12:01:22Z) - Few-shot learning for COVID-19 Chest X-Ray Classification with
Imbalanced Data: An Inter vs. Intra Domain Study [49.5374512525016]
Medical image datasets are essential for training models used in computer-aided diagnosis, treatment planning, and medical research.
Some challenges are associated with these datasets, including variability in data distribution, data scarcity, and transfer learning issues when using models pre-trained from generic images.
We propose a methodology based on Siamese neural networks in which a series of techniques are integrated to mitigate the effects of data scarcity and distribution imbalance.
arXiv Detail & Related papers (2024-01-18T16:59:27Z) - Genetic InfoMax: Exploring Mutual Information Maximization in
High-Dimensional Imaging Genetics Studies [50.11449968854487]
Genome-wide association studies (GWAS) are used to identify relationships between genetic variations and specific traits.
Representation learning for imaging genetics is largely under-explored due to the unique challenges posed by GWAS.
We introduce a trans-modal learning framework Genetic InfoMax (GIM) to address the specific challenges of GWAS.
arXiv Detail & Related papers (2023-09-26T03:59:21Z) - Statistical Learning to Operationalize a Domain Agnostic Data Quality
Scoring [8.864453148536061]
The research study provides an automated platform which takes an incoming dataset and metadata to provide the DQ score, report and label.
The results of this study would be useful to data scientists as the value of this quality label would instill confidence before deploying the data for his/her respective practical application.
arXiv Detail & Related papers (2021-08-16T12:20:57Z) - Fader Networks for domain adaptation on fMRI: ABIDE-II study [68.5481471934606]
We use 3D convolutional autoencoders to build the domain irrelevant latent space image representation and demonstrate this method to outperform existing approaches on ABIDE data.
arXiv Detail & Related papers (2020-10-14T16:50:50Z) - Deep Mining External Imperfect Data for Chest X-ray Disease Screening [57.40329813850719]
We argue that incorporating an external CXR dataset leads to imperfect training data, which raises the challenges.
We formulate the multi-label disease classification problem as weighted independent binary tasks according to the categories.
Our framework simultaneously models and tackles the domain and label discrepancies, enabling superior knowledge mining ability.
arXiv Detail & Related papers (2020-06-06T06:48:40Z) - GraspNet: A Large-Scale Clustered and Densely Annotated Dataset for
Object Grasping [49.777649953381676]
We contribute a large-scale grasp pose detection dataset with a unified evaluation system.
Our dataset contains 87,040 RGBD images with over 370 million grasp poses.
arXiv Detail & Related papers (2019-12-31T18:15:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.