On the Composition and Limitations of Publicly Available COVID-19 X-Ray
Imaging Datasets
- URL: http://arxiv.org/abs/2008.11572v1
- Date: Wed, 26 Aug 2020 14:16:01 GMT
- Title: On the Composition and Limitations of Publicly Available COVID-19 X-Ray
Imaging Datasets
- Authors: Beatriz Garcia Santa Cruz, Jan S\"olter, Matias Nicolas Bossa and
Andreas Dominik Husch
- Abstract summary: Data scarcity, mismatch between training and target population, group imbalance, and lack of documentation are important sources of bias.
This paper presents an overview of the currently public available COVID-19 chest X-ray datasets.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine learning based methods for diagnosis and progression prediction of
COVID-19 from imaging data have gained significant attention in the last
months, in particular by the use of deep learning models. In this context
hundreds of models where proposed with the majority of them trained on public
datasets. Data scarcity, mismatch between training and target population, group
imbalance, and lack of documentation are important sources of bias, hindering
the applicability of these models to real-world clinical practice. Considering
that datasets are an essential part of model building and evaluation, a deeper
understanding of the current landscape is needed. This paper presents an
overview of the currently public available COVID-19 chest X-ray datasets. Each
dataset is briefly described and potential strength, limitations and
interactions between datasets are identified. In particular, some key
properties of current datasets that could be potential sources of bias,
impairing models trained on them are pointed out. These descriptions are useful
for model building on those datasets, to choose the best dataset according the
model goal, to take into account the specific limitations to avoid reporting
overconfident benchmark results, and to discuss their impact on the
generalisation capabilities in a specific clinical setting
Related papers
- Visual Data Diagnosis and Debiasing with Concept Graphs [50.84781894621378]
We present ConBias, a framework for diagnosing and mitigating Concept co-occurrence Biases in visual datasets.
We show that by employing a novel clique-based concept balancing strategy, we can mitigate these imbalances, leading to enhanced performance on downstream tasks.
arXiv Detail & Related papers (2024-09-26T16:59:01Z) - Does Data-Efficient Generalization Exacerbate Bias in Foundation Models? [2.298227866545911]
Foundation models have emerged as robust models with label efficiency in diverse domains.
It is unclear whether using a large amount of unlabeled data, biased by the presence of sensitive attributes during pre-training, influences the fairness of the model.
This research examines the bias in the Foundation model when it is applied to fine-tune the Brazilian Multilabel Ophthalmological dataset.
arXiv Detail & Related papers (2024-08-28T22:14:44Z) - A Language Model-Guided Framework for Mining Time Series with Distributional Shifts [5.082311792764403]
This paper presents an approach that utilizes large language models and data source interfaces to explore and collect time series datasets.
While obtained from external sources, the collected data share critical statistical properties with primary time series datasets.
It suggests that collected datasets can effectively supplement existing datasets, especially involving changes in data distribution.
arXiv Detail & Related papers (2024-06-07T20:21:07Z) - dacl1k: Real-World Bridge Damage Dataset Putting Open-Source Data to the
Test [0.6827423171182154]
"dacl1k" is a multi-label RCD dataset for multi-label classification based on building inspections including 1,474 images.
We trained the models on different combinations of open-source data (meta datasets) which were subsequently evaluated both extrinsically and intrinsically.
The performance analysis on dacl1k shows practical usability of the meta data, where the best model shows an Exact Match Ratio of 32%.
arXiv Detail & Related papers (2023-09-07T15:05:35Z) - Synthetic Model Combination: An Instance-wise Approach to Unsupervised
Ensemble Learning [92.89846887298852]
Consider making a prediction over new test data without any opportunity to learn from a training set of labelled data.
Give access to a set of expert models and their predictions alongside some limited information about the dataset used to train them.
arXiv Detail & Related papers (2022-10-11T10:20:31Z) - Data-SUITE: Data-centric identification of in-distribution incongruous
examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data.
We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - A Real Use Case of Semi-Supervised Learning for Mammogram Classification
in a Local Clinic of Costa Rica [0.5541644538483946]
Training a deep learning model requires a considerable amount of labeled images.
A number of publicly available datasets have been built with data from different hospitals and clinics.
The use of the semi-supervised deep learning approach known as MixMatch, to leverage the usage of unlabeled data is proposed and evaluated.
arXiv Detail & Related papers (2021-07-24T22:26:50Z) - Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples.
We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models.
We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z) - Dataset Cartography: Mapping and Diagnosing Datasets with Training
Dynamics [118.75207687144817]
We introduce Data Maps, a model-based tool to characterize and diagnose datasets.
We leverage a largely ignored source of information: the behavior of the model on individual instances during training.
Our results indicate that a shift in focus from quantity to quality of data could lead to robust models and improved out-of-distribution generalization.
arXiv Detail & Related papers (2020-09-22T20:19:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.