Investigating the Quality of DermaMNIST and Fitzpatrick17k
Dermatological Image Datasets
- URL: http://arxiv.org/abs/2401.14497v1
- Date: Thu, 25 Jan 2024 20:29:01 GMT
- Title: Investigating the Quality of DermaMNIST and Fitzpatrick17k
Dermatological Image Datasets
- Authors: Kumar Abhishek, Aditi Jain, Ghassan Hamarneh
- Abstract summary: We conduct meticulous analyses of two popular dermatological image datasets: DermaMNIST and Fitzpatrick17k.
We uncover data quality issues, measure the effects of these problems on the benchmark results, and propose corrections to the datasets.
- Score: 19.128392861461297
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The remarkable progress of deep learning in dermatological tasks has brought
us closer to achieving diagnostic accuracies comparable to those of human
experts. However, while large datasets play a crucial role in the development
of reliable deep neural network models, the quality of data therein and their
correct usage are of paramount importance. Several factors can impact data
quality, such as the presence of duplicates, data leakage across train-test
partitions, mislabeled images, and the absence of a well-defined test
partition. In this paper, we conduct meticulous analyses of two popular
dermatological image datasets: DermaMNIST and Fitzpatrick17k, uncovering these
data quality issues, measure the effects of these problems on the benchmark
results, and propose corrections to the datasets. Besides ensuring the
reproducibility of our analysis, by making our analysis pipeline and the
accompanying code publicly available, we aim to encourage similar explorations
and to facilitate the identification and addressing of potential data quality
issues in other large datasets.
Related papers
- A Guide to Misinformation Detection Datasets [5.673951146506489]
This guide aims to provide a roadmap for obtaining higher quality data and conducting more effective evaluations.
All datasets and other artifacts are available at https://misinfo-datasets.complexdatalab.com/.
arXiv Detail & Related papers (2024-11-07T18:47:39Z) - Is Dataset Quality Still a Concern in Diagnosis Using Large Foundation Model? [33.71784955496207]
An LFM has been developed for fundus images using the Vision Transformer (VIT) and a self-supervised learning framework.
To investigate the influence of data quality on LFM, we conducted explorations in two fundus diagnosis tasks using datasets of varying quality.
Our investigation found that LFM exhibits greater resilience to dataset quality issues, including image quality and dataset bias, compared to typical convolutional networks.
arXiv Detail & Related papers (2024-05-21T08:27:35Z) - Copycats: the many lives of a publicly available medical imaging dataset [12.98380178359767]
Medical Imaging (MI) datasets are fundamental to artificial intelligence in healthcare.
MI datasets used to be proprietary, but have become increasingly available to the public, including on community-contributed platforms (CCPs) like Kaggle or HuggingFace.
While open data is important to enhance the redistribution of data's public value, we find that the current CCP governance model fails to uphold the quality needed and recommended practices for sharing, documenting, and evaluating datasets.
arXiv Detail & Related papers (2024-02-09T12:01:22Z) - Few-shot learning for COVID-19 Chest X-Ray Classification with
Imbalanced Data: An Inter vs. Intra Domain Study [49.5374512525016]
Medical image datasets are essential for training models used in computer-aided diagnosis, treatment planning, and medical research.
Some challenges are associated with these datasets, including variability in data distribution, data scarcity, and transfer learning issues when using models pre-trained from generic images.
We propose a methodology based on Siamese neural networks in which a series of techniques are integrated to mitigate the effects of data scarcity and distribution imbalance.
arXiv Detail & Related papers (2024-01-18T16:59:27Z) - Exploring Dataset-Scale Indicators of Data Quality [23.017200605976807]
Modern computer vision foundation models are trained on massive amounts of data, incurring large economic and environmental costs.
Recent research has suggested that improving data quality can significantly reduce the need for data quantity.
We posit that the quality of a given dataset can be decomposed into distinct sample-level and dataset-level constituents.
arXiv Detail & Related papers (2023-11-07T14:14:32Z) - Analyzing the Effects of Handling Data Imbalance on Learned Features
from Medical Images by Looking Into the Models [50.537859423741644]
Training a model on an imbalanced dataset can introduce unique challenges to the learning problem.
We look deeper into the internal units of neural networks to observe how handling data imbalance affects the learned features.
arXiv Detail & Related papers (2022-04-04T09:38:38Z) - Statistical Learning to Operationalize a Domain Agnostic Data Quality
Scoring [8.864453148536061]
The research study provides an automated platform which takes an incoming dataset and metadata to provide the DQ score, report and label.
The results of this study would be useful to data scientists as the value of this quality label would instill confidence before deploying the data for his/her respective practical application.
arXiv Detail & Related papers (2021-08-16T12:20:57Z) - Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious.
We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z) - Fader Networks for domain adaptation on fMRI: ABIDE-II study [68.5481471934606]
We use 3D convolutional autoencoders to build the domain irrelevant latent space image representation and demonstrate this method to outperform existing approaches on ABIDE data.
arXiv Detail & Related papers (2020-10-14T16:50:50Z) - Deep Mining External Imperfect Data for Chest X-ray Disease Screening [57.40329813850719]
We argue that incorporating an external CXR dataset leads to imperfect training data, which raises the challenges.
We formulate the multi-label disease classification problem as weighted independent binary tasks according to the categories.
Our framework simultaneously models and tackles the domain and label discrepancies, enabling superior knowledge mining ability.
arXiv Detail & Related papers (2020-06-06T06:48:40Z) - Semi-supervised Medical Image Classification with Relation-driven
Self-ensembling Model [71.80319052891817]
We present a relation-driven semi-supervised framework for medical image classification.
It exploits the unlabeled data by encouraging the prediction consistency of given input under perturbations.
Our method outperforms many state-of-the-art semi-supervised learning methods on both single-label and multi-label image classification scenarios.
arXiv Detail & Related papers (2020-05-15T06:57:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.