Towards Reliable Dermatology Evaluation Benchmarks
- URL: http://arxiv.org/abs/2309.06961v2
- Date: Sat, 16 Dec 2023 06:14:00 GMT
- Title: Towards Reliable Dermatology Evaluation Benchmarks
- Authors: Fabian Gr\"oger, Simone Lionetti, Philippe Gottfrois, Alvaro
Gonzalez-Jimenez, Matthew Groh, Roxana Daneshjou, Labelling Consortium,
Alexander A. Navarini, Marc Pouly
- Abstract summary: Benchmark datasets for digital dermatology unwittingly contain inaccuracies that reduce trust in model performance estimates.
We propose a resource-efficient data-cleaning protocol to identify issues that escaped previous curation.
- Score: 37.464923424849964
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Benchmark datasets for digital dermatology unwittingly contain inaccuracies
that reduce trust in model performance estimates. We propose a
resource-efficient data-cleaning protocol to identify issues that escaped
previous curation. The protocol leverages an existing algorithmic cleaning
strategy and is followed by a confirmation process terminated by an intuitive
stopping criterion. Based on confirmation by multiple dermatologists, we remove
irrelevant samples and near duplicates and estimate the percentage of label
errors in six dermatology image datasets for model evaluation promoted by the
International Skin Imaging Collaboration. Along with this paper, we publish
revised file lists for each dataset which should be used for model evaluation.
Our work paves the way for more trustworthy performance assessment in digital
dermatology.
Related papers
- GRASP-PsONet: Gradient-based Removal of Spurious Patterns for PsOriasis Severity Classification [0.0]
We propose a framework to automatically flag problematic training images that introduce spurious correlations.<n>Removing 8.2% of flagged images improves model AUC-ROC by 5% (85% to 90%) on a held out test set.<n>When applied to a subset of training data rated by two dermatologists, the method identifies over 90% of cases with inter-rater disagreement.
arXiv Detail & Related papers (2025-06-27T03:42:09Z) - Benchmarking Robustness of Contrastive Learning Models for Medical Image-Report Retrieval [2.9801426627439453]
This study benchmarks the robustness of four state-of-the-art contrastive learning models: CLIP, CXR-RePaiR, MedCLIP, and CXR-CLIP.
Our findings reveal that all evaluated models are highly sensitive to out-of-distribution data.
By addressing these limitations, we can develop more reliable cross-domain retrieval models for medical applications.
arXiv Detail & Related papers (2025-01-15T20:37:04Z) - An analysis of data variation and bias in image-based dermatological datasets for machine learning classification [2.039829968340841]
In clinical dermatology, classification models can detect malignant lesions on patients' skin using only RGB images as input.
Most learning-based methods employ data acquired from dermoscopic datasets on training, which are large and validated by a gold standard.
This work aims to evaluate the gap between dermoscopic and clinical samples and understand how the dataset variations impact training.
arXiv Detail & Related papers (2025-01-15T17:18:46Z) - LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers.
LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs.
We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z) - Multi-task Explainable Skin Lesion Classification [54.76511683427566]
We propose a few-shot-based approach for skin lesions that generalizes well with few labelled data.
The proposed approach comprises a fusion of a segmentation network that acts as an attention module and classification network.
arXiv Detail & Related papers (2023-10-11T05:49:47Z) - Estimating label quality and errors in semantic segmentation data via
any model [19.84626033109009]
We study methods to score label quality, such that the images with the lowest scores are least likely to be correctly labeled.
This helps prioritize what data to review in order to ensure a high-quality training/evaluation dataset.
arXiv Detail & Related papers (2023-07-11T07:29:09Z) - Intrinsic Self-Supervision for Data Quality Audits [35.69673085324971]
Benchmark datasets in computer vision often contain off-topic images, near duplicates, and label errors.
In this paper, we revisit the task of data cleaning and formalize it as either a ranking problem, or a scoring problem.
We find that a specific combination of context-aware self-supervised representation learning and distance-based indicators is effective in finding issues without annotation biases.
arXiv Detail & Related papers (2023-05-26T15:57:04Z) - Self-Supervised Learning as a Means To Reduce the Need for Labeled Data
in Medical Image Analysis [64.4093648042484]
We use a dataset of chest X-ray images with bounding box labels for 13 different classes of anomalies.
We show that it is possible to achieve similar performance to a fully supervised model in terms of mean average precision and accuracy with only 60% of the labeled data.
arXiv Detail & Related papers (2022-06-01T09:20:30Z) - Cascaded Robust Learning at Imperfect Labels for Chest X-ray
Segmentation [61.09321488002978]
We present a novel cascaded robust learning framework for chest X-ray segmentation with imperfect annotation.
Our model consists of three independent network, which can effectively learn useful information from the peer networks.
Our methods could achieve a significant improvement on the accuracy in segmentation tasks compared to the previous methods.
arXiv Detail & Related papers (2021-04-05T15:50:16Z) - Cancer image classification based on DenseNet model [3.3516258832067067]
We propose a novel metastatic cancer image classification model based on DenseNet Block.
We evaluate the proposed approach to the slightly modified version of the PatchCamelyon (PCam) benchmark dataset.
arXiv Detail & Related papers (2020-11-23T03:05:42Z) - Semi-supervised Medical Image Classification with Relation-driven
Self-ensembling Model [71.80319052891817]
We present a relation-driven semi-supervised framework for medical image classification.
It exploits the unlabeled data by encouraging the prediction consistency of given input under perturbations.
Our method outperforms many state-of-the-art semi-supervised learning methods on both single-label and multi-label image classification scenarios.
arXiv Detail & Related papers (2020-05-15T06:57:54Z) - An Extensive Study on Cross-Dataset Bias and Evaluation Metrics
Interpretation for Machine Learning applied to Gastrointestinal Tract
Abnormality Classification [2.985964157078619]
Automatic analysis of diseases in the GI tract is a hot topic in computer science and medical-related journals.
A clear understanding of evaluation metrics and machine learning models with cross datasets is crucial to bring research in the field to a new quality level.
We present comprehensive evaluations of five distinct machine learning models that can classify 16 different GI tract conditions.
arXiv Detail & Related papers (2020-05-08T08:59:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.