MedShift: identifying shift data for medical dataset curation
- URL: http://arxiv.org/abs/2112.13885v1
- Date: Mon, 27 Dec 2021 20:06:23 GMT
- Title: MedShift: identifying shift data for medical dataset curation
- Authors: Xiaoyuan Guo, Judy Wawira Gichoya, Hari Trivedi, Saptarshi Purkayastha
and Imon Banerjee
- Abstract summary: Methods to detect shift or variance in data have not been significantly researched.
We propose a unified pipeline called MedShift to detect top-level shift samples.
We verify the efficacy of MedShift with musculoskeletal radiographs (MURA) and chest X-rays datasets from more than one external source.
- Score: 2.4236602474594635
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: To curate a high-quality dataset, identifying data variance between the
internal and external sources is a fundamental and crucial step. However,
methods to detect shift or variance in data have not been significantly
researched. Challenges to this are the lack of effective approaches to learn
dense representation of a dataset and difficulties of sharing private data
across medical institutions. To overcome the problems, we propose a unified
pipeline called MedShift to detect the top-level shift samples and thus
facilitate the medical curation. Given an internal dataset A as the base
source, we first train anomaly detectors for each class of dataset A to learn
internal distributions in an unsupervised way. Second, without exchanging data
across sources, we run the trained anomaly detectors on an external dataset B
for each class. The data samples with high anomaly scores are identified as
shift data. To quantify the shiftness of the external dataset, we cluster B's
data into groups class-wise based on the obtained scores. We then train a
multi-class classifier on A and measure the shiftness with the classifier's
performance variance on B by gradually dropping the group with the largest
anomaly score for each class. Additionally, we adapt a dataset quality metric
to help inspect the distribution differences for multiple medical sources. We
verify the efficacy of MedShift with musculoskeletal radiographs (MURA) and
chest X-rays datasets from more than one external source. Experiments show our
proposed shift data detection pipeline can be beneficial for medical centers to
curate high-quality datasets more efficiently. An interface introduction video
to visualize our results is available at https://youtu.be/V3BF0P1sxQE.
Related papers
- Adversarial Learning for Feature Shift Detection and Correction [45.65548560695731]
Feature shifts can occur in many datasets, including in multi-sensor data, where some sensors are malfunctioning, or in structured data, where faulty standardization and data processing pipelines can lead to erroneous features.
In this work, we explore using the principles of adversarial learning, where the information from several discriminators trained to distinguish between two distributions is used to both detect the corrupted features and fix them in order to remove the distribution shift between datasets.
arXiv Detail & Related papers (2023-12-07T18:58:40Z) - Dynamic Multimodal Information Bottleneck for Multimodality
Classification [26.65073424377933]
We propose a dynamic multimodal information bottleneck framework for attaining a robust fused feature representation.
Specifically, our information bottleneck module serves to filter out the task-irrelevant information and noises in the fused feature.
Our method surpasses the state-of-the-art and is significantly more robust, being the only method to remain performance when large-scale noisy channels exist.
arXiv Detail & Related papers (2023-11-02T08:34:08Z) - Binary Quantification and Dataset Shift: An Experimental Investigation [54.14283123210872]
Quantification is the supervised learning task that consists of training predictors of the class prevalence values of sets of unlabelled data.
The relationship between quantification and other types of dataset shift remains, by and large, unexplored.
We propose a fine-grained taxonomy of types of dataset shift, by establishing protocols for the generation of datasets affected by these types of shift.
arXiv Detail & Related papers (2023-10-06T20:11:27Z) - ArSDM: Colonoscopy Images Synthesis with Adaptive Refinement Semantic
Diffusion Models [69.9178140563928]
Colonoscopy analysis is essential for assisting clinical diagnosis and treatment.
The scarcity of annotated data limits the effectiveness and generalization of existing methods.
We propose an Adaptive Refinement Semantic Diffusion Model (ArSDM) to generate colonoscopy images that benefit the downstream tasks.
arXiv Detail & Related papers (2023-09-03T07:55:46Z) - Source-Free Collaborative Domain Adaptation via Multi-Perspective
Feature Enrichment for Functional MRI Analysis [55.03872260158717]
Resting-state MRI functional (rs-fMRI) is increasingly employed in multi-site research to aid neurological disorder analysis.
Many methods have been proposed to reduce fMRI heterogeneity between source and target domains.
But acquiring source data is challenging due to concerns and/or data storage burdens in multi-site studies.
We design a source-free collaborative domain adaptation framework for fMRI analysis, where only a pretrained source model and unlabeled target data are accessible.
arXiv Detail & Related papers (2023-08-24T01:30:18Z) - ALGAN: Anomaly Detection by Generating Pseudo Anomalous Data via Latent
Variables [17.53032543377636]
We propose an Anomalous Latent variable Generative Adversarial Network (ALGAN) in which the GAN generator produces pseudo-anomalous data as well as fake-normal data.
The proposed ALGAN exhibited an AUROC comparable to state-of-the-art methods while achieving a much faster prediction time.
arXiv Detail & Related papers (2022-02-21T14:53:05Z) - Embracing the Disharmony in Heterogeneous Medical Data [12.739380441313022]
Heterogeneity in medical imaging data is often tackled, in the context of machine learning, using domain invariance.
This paper instead embraces the heterogeneity and treats it as a multi-task learning problem.
We show that this approach improves classification accuracy by 5-30 % across different datasets on the main classification tasks.
arXiv Detail & Related papers (2021-03-23T21:36:39Z) - My Health Sensor, my Classifier: Adapting a Trained Classifier to
Unlabeled End-User Data [0.5091527753265949]
In this work, we present an approach for unsupervised domain adaptation (DA) with the constraint, that the labeled source data are not directly available.
Our solution, iteratively labels only high confidence sub-regions of the target data distribution, based on the belief of the classifier.
The goal is to apply the proposed approach on DA for the task of sleep apnea detection and achieve personalization based on the needs of the patient.
arXiv Detail & Related papers (2020-09-22T20:27:35Z) - Learning Invariant Feature Representation to Improve Generalization
across Chest X-ray Datasets [55.06983249986729]
We show that a deep learning model performing well when tested on the same dataset as training data starts to perform poorly when it is tested on a dataset from a different source.
By employing an adversarial training strategy, we show that a network can be forced to learn a source-invariant representation.
arXiv Detail & Related papers (2020-08-04T07:41:15Z) - ATSO: Asynchronous Teacher-Student Optimization for Semi-Supervised
Medical Image Segmentation [99.90263375737362]
We propose ATSO, an asynchronous version of teacher-student optimization.
ATSO partitions the unlabeled data into two subsets and alternately uses one subset to fine-tune the model and updates the label on the other subset.
We evaluate ATSO on two popular medical image segmentation datasets and show its superior performance in various semi-supervised settings.
arXiv Detail & Related papers (2020-06-24T04:05:12Z) - Deep Mining External Imperfect Data for Chest X-ray Disease Screening [57.40329813850719]
We argue that incorporating an external CXR dataset leads to imperfect training data, which raises the challenges.
We formulate the multi-label disease classification problem as weighted independent binary tasks according to the categories.
Our framework simultaneously models and tackles the domain and label discrepancies, enabling superior knowledge mining ability.
arXiv Detail & Related papers (2020-06-06T06:48:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.