A Semi-Supervised Algorithm for Improving the Consistency of
Crowdsourced Datasets: The COVID-19 Case Study on Respiratory Disorder
Classification
- URL: http://arxiv.org/abs/2209.04360v1
- Date: Fri, 9 Sep 2022 15:44:26 GMT
- Title: A Semi-Supervised Algorithm for Improving the Consistency of
Crowdsourced Datasets: The COVID-19 Case Study on Respiratory Disorder
Classification
- Authors: Lara Orlandic, Tomas Teijeiro, David Atienza
- Abstract summary: Cough audio signal classification is a potentially useful tool in screening for respiratory disorders, such as COVID-19.
Many research teams have turned to crowdsourcing to quickly gather cough sound data, as it was done to generate the COUGHVID dataset.
The COUGHVID dataset enlisted expert physicians to diagnose the underlying diseases present in a limited number of uploaded recordings.
This work uses a semi-supervised learning (SSL) approach to improve the labeling consistency of the COUGHVID dataset and the robustness of COVID-19 versus healthy cough sound classification.
- Score: 4.431270735024064
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cough audio signal classification is a potentially useful tool in screening
for respiratory disorders, such as COVID-19. Since it is dangerous to collect
data from patients with such contagious diseases, many research teams have
turned to crowdsourcing to quickly gather cough sound data, as it was done to
generate the COUGHVID dataset. The COUGHVID dataset enlisted expert physicians
to diagnose the underlying diseases present in a limited number of uploaded
recordings. However, this approach suffers from potential mislabeling of the
coughs, as well as notable disagreement between experts. In this work, we use a
semi-supervised learning (SSL) approach to improve the labeling consistency of
the COUGHVID dataset and the robustness of COVID-19 versus healthy cough sound
classification. First, we leverage existing SSL expert knowledge aggregation
techniques to overcome the labeling inconsistencies and sparsity in the
dataset. Next, our SSL approach is used to identify a subsample of re-labeled
COUGHVID audio samples that can be used to train or augment future cough
classification models. The consistency of the re-labeled data is demonstrated
in that it exhibits a high degree of class separability, 3x higher than that of
the user-labeled data, despite the expert label inconsistency present in the
original dataset. Furthermore, the spectral differences in the user-labeled
audio segments are amplified in the re-labeled data, resulting in significantly
different power spectral densities between healthy and COVID-19 coughs, which
demonstrates both the increased consistency of the new dataset and its
explainability from an acoustic perspective. Finally, we demonstrate how the
re-labeled dataset can be used to train a cough classifier. This SSL approach
can be used to combine the medical knowledge of several experts to improve the
database consistency for any diagnostic classification task.
Related papers
- Towards reliable respiratory disease diagnosis based on cough sounds and vision transformers [14.144599890583308]
We propose a novel approach to cough-based disease classification based on both self-supervised and supervised learning on a large-scale cough data set.
Experimental results demonstrate our proposed approach outperforms prior arts consistently on two benchmark datasets for COVID-19 diagnosis and a proprietary dataset for COPD/non-COPD classification with an AUROC of 92.5%.
arXiv Detail & Related papers (2024-08-28T09:40:40Z) - Improving Multiple Sclerosis Lesion Segmentation Across Clinical Sites:
A Federated Learning Approach with Noise-Resilient Training [75.40980802817349]
Deep learning models have shown promise for automatically segmenting MS lesions, but the scarcity of accurately annotated data hinders progress in this area.
We introduce a Decoupled Hard Label Correction (DHLC) strategy that considers the imbalanced distribution and fuzzy boundaries of MS lesions.
We also introduce a Centrally Enhanced Label Correction (CELC) strategy, which leverages the aggregated central model as a correction teacher for all sites.
arXiv Detail & Related papers (2023-08-31T00:36:10Z) - Synthetic Augmentation with Large-scale Unconditional Pre-training [4.162192894410251]
We propose a synthetic augmentation method called HistoDiffusion to reduce the dependency on annotated data.
HistoDiffusion can be pre-trained on large-scale unlabeled datasets and later applied to a small-scale labeled dataset for augmented training.
We evaluate our proposed method by pre-training on three histopathology datasets and testing on a histopathology dataset of colorectal cancer (CRC) excluded from the pre-training datasets.
arXiv Detail & Related papers (2023-08-08T03:34:04Z) - Semi-Supervised Learning for Multi-Label Cardiovascular Diseases
Prediction:A Multi-Dataset Study [17.84069222975825]
Current ECG-based diagnosis systems show promising performance owing to the rapid development of deep learning techniques.
Label scarcity problem, the co-occurrence of multiple CVDs and the poor performance on unseen datasets hinder the widespread application of deep learning-based models.
We propose a multi-label semi-supervised model (ECGMatch) to recognize multiple CVDs simultaneously with limited supervision.
arXiv Detail & Related papers (2023-06-18T07:46:19Z) - Patch-Mix Contrastive Learning with Audio Spectrogram Transformer on
Respiratory Sound Classification [19.180927437627282]
We introduce a novel and effective Patch-Mix Contrastive Learning to distinguish the mixed representations in the latent space.
Our method achieves state-of-the-art performance on the ICBHI dataset, outperforming the prior leading score by an improvement of 4.08%.
arXiv Detail & Related papers (2023-05-23T13:04:07Z) - Improving Medical Image Classification with Label Noise Using
Dual-uncertainty Estimation [72.0276067144762]
We discuss and define the two common types of label noise in medical images.
We propose an uncertainty estimation-based framework to handle these two label noise amid the medical image classification task.
arXiv Detail & Related papers (2021-02-28T14:56:45Z) - Correcting Data Imbalance for Semi-Supervised Covid-19 Detection Using
X-ray Chest Images [4.1950566803514935]
We evaluate the performance of the semi-supervised deep learning architecture known as MixMatch.
A new dataset is included among thetested datasets, composed of chest X-ray images of Costa Rican adult patients.
arXiv Detail & Related papers (2020-08-19T15:16:57Z) - Deep Mining External Imperfect Data for Chest X-ray Disease Screening [57.40329813850719]
We argue that incorporating an external CXR dataset leads to imperfect training data, which raises the challenges.
We formulate the multi-label disease classification problem as weighted independent binary tasks according to the categories.
Our framework simultaneously models and tackles the domain and label discrepancies, enabling superior knowledge mining ability.
arXiv Detail & Related papers (2020-06-06T06:48:40Z) - Semi-supervised Medical Image Classification with Relation-driven
Self-ensembling Model [71.80319052891817]
We present a relation-driven semi-supervised framework for medical image classification.
It exploits the unlabeled data by encouraging the prediction consistency of given input under perturbations.
Our method outperforms many state-of-the-art semi-supervised learning methods on both single-label and multi-label image classification scenarios.
arXiv Detail & Related papers (2020-05-15T06:57:54Z) - Predictive Modeling of ICU Healthcare-Associated Infections from
Imbalanced Data. Using Ensembles and a Clustering-Based Undersampling
Approach [55.41644538483948]
This work is focused on both the identification of risk factors and the prediction of healthcare-associated infections in intensive-care units.
The aim is to support decision making addressed at reducing the incidence rate of infections.
arXiv Detail & Related papers (2020-05-07T16:13:12Z) - Self-Training with Improved Regularization for Sample-Efficient Chest
X-Ray Classification [80.00316465793702]
We present a deep learning framework that enables robust modeling in challenging scenarios.
Our results show that using 85% lesser labeled data, we can build predictive models that match the performance of classifiers trained in a large-scale data setting.
arXiv Detail & Related papers (2020-05-03T02:36:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.