Understanding Dataset Bias in Medical Imaging: A Case Study on Chest X-rays
- URL: http://arxiv.org/abs/2507.07722v4
- Date: Fri, 18 Jul 2025 12:34:29 GMT
- Title: Understanding Dataset Bias in Medical Imaging: A Case Study on Chest X-rays
- Authors: Ethan Dack, Chengliang Dai,
- Abstract summary: We revisit the same task applied to popular open-source chest X-ray datasets.<n>We apply simple transformations to the datasets, repeat the same task, and perform an analysis to identify and explain any detected biases.<n>We implement a range of different network architectures on the datasets: NIH, CheXpert, MIMIC-CXR and PadChest.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent works have revisited the infamous task ``Name That Dataset'', demonstrating that non-medical datasets contain underlying biases and that the dataset origin task can be solved with high accuracy. In this work, we revisit the same task applied to popular open-source chest X-ray datasets. Medical images are naturally more difficult to release for open-source due to their sensitive nature, which has led to certain open-source datasets being extremely popular for research purposes. By performing the same task, we wish to explore whether dataset bias also exists in these datasets. To extend our work, we apply simple transformations to the datasets, repeat the same task, and perform an analysis to identify and explain any detected biases. Given the importance of AI applications in medical imaging, it's vital to establish whether modern methods are taking shortcuts or are focused on the relevant pathology. We implement a range of different network architectures on the datasets: NIH, CheXpert, MIMIC-CXR and PadChest. We hope this work will encourage more explainable research being performed in medical imaging and the creation of more open-source datasets in the medical domain. Our code can be found here: https://github.com/eedack01/x_ray_ds_bias.
Related papers
- In the Picture: Medical Imaging Datasets, Artifacts, and their Living Review [18.178774133733686]
We propose a living review that continuously tracks public datasets and their associated research artifacts across multiple medical imaging applications.<n>We discuss key considerations for creating medical imaging datasets, review best practices for data annotation, discuss the significance of shortcuts and demographic diversity, and emphasize the importance of managing datasets throughout their entire lifecycle.
arXiv Detail & Related papers (2025-01-18T11:03:59Z) - GazeSearch: Radiology Findings Search Benchmark [9.21918773048464]
Medical eye-tracking data is an important information source for understanding how radiologists visually interpret medical images.<n>The current eye-tracking data is dispersed, unprocessed, and ambiguous, making it difficult to derive meaningful insights.<n>In this work, we propose a refinement method inspired by the target-present visual search challenge: there is a specific finding and fixations are guided to locate it.
arXiv Detail & Related papers (2024-11-08T18:47:08Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning [65.54680361074882]
Eye-gaze Guided Multi-modal Alignment (EGMA) framework harnesses eye-gaze data for better alignment of medical visual and textual features.
We conduct downstream tasks of image classification and image-text retrieval on four medical datasets.
arXiv Detail & Related papers (2024-03-19T03:59:14Z) - Is in-domain data beneficial in transfer learning for landmarks
detection in x-ray images? [1.5348047288817481]
We study whether the usage of small-scale in-domain x-ray image datasets may provide any improvement for landmark detection over models pre-trained on large natural image datasets only.
Our results show that using in-domain source datasets brings marginal or no benefit with respect to an ImageNet out-of-domain pre-training.
Our findings can provide an indication for the development of robust landmark detection systems in medical images when no large annotated dataset is available.
arXiv Detail & Related papers (2024-03-03T10:35:00Z) - Source-Free Collaborative Domain Adaptation via Multi-Perspective
Feature Enrichment for Functional MRI Analysis [55.03872260158717]
Resting-state MRI functional (rs-fMRI) is increasingly employed in multi-site research to aid neurological disorder analysis.
Many methods have been proposed to reduce fMRI heterogeneity between source and target domains.
But acquiring source data is challenging due to concerns and/or data storage burdens in multi-site studies.
We design a source-free collaborative domain adaptation framework for fMRI analysis, where only a pretrained source model and unlabeled target data are accessible.
arXiv Detail & Related papers (2023-08-24T01:30:18Z) - XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models [72.8965643836841]
We introduce XrayGPT, a novel conversational medical vision-language model.<n>It can analyze and answer open-ended questions about chest radiographs.<n>We generate 217k interactive and high-quality summaries from free-text radiology reports.
arXiv Detail & Related papers (2023-06-13T17:59:59Z) - Dataset Distillation for Medical Dataset Sharing [38.65823547986758]
dataset distillation can synthesize a small dataset such that models trained on it achieve comparable performance with the original large dataset.
Experimental results on a COVID-19 chest X-ray image dataset show that our method can achieve high detection performance even using scarce anonymized chest X-ray images.
arXiv Detail & Related papers (2022-09-29T07:49:20Z) - Computer-aided Tuberculosis Diagnosis with Attribute Reasoning
Assistance [58.01014026139231]
We propose a new large-scale tuberculosis (TB) chest X-ray dataset (TBX-Att)
We establish an attribute-assisted weakly-supervised framework to classify and localize TB by leveraging the attribute information.
The proposed model is evaluated on the TBX-Att dataset and will serve as a solid baseline for future research.
arXiv Detail & Related papers (2022-07-01T07:50:35Z) - Exploring and Distilling Posterior and Prior Knowledge for Radiology
Report Generation [55.00308939833555]
The PPKED includes three modules: Posterior Knowledge Explorer (PoKE), Prior Knowledge Explorer (PrKE) and Multi-domain Knowledge Distiller (MKD)
PoKE explores the posterior knowledge, which provides explicit abnormal visual regions to alleviate visual data bias.
PrKE explores the prior knowledge from the prior medical knowledge graph (medical knowledge) and prior radiology reports (working experience) to alleviate textual data bias.
arXiv Detail & Related papers (2021-06-13T11:10:02Z) - Chest x-ray automated triage: a semiologic approach designed for
clinical implementation, exploiting different types of labels through a
combination of four Deep Learning architectures [83.48996461770017]
This work presents a Deep Learning method based on the late fusion of different convolutional architectures.
We built four training datasets combining images from public chest x-ray datasets and our institutional archive.
We trained four different Deep Learning architectures and combined their outputs with a late fusion strategy, obtaining a unified tool.
arXiv Detail & Related papers (2020-12-23T14:38:35Z) - Learning Invariant Feature Representation to Improve Generalization
across Chest X-ray Datasets [55.06983249986729]
We show that a deep learning model performing well when tested on the same dataset as training data starts to perform poorly when it is tested on a dataset from a different source.
By employing an adversarial training strategy, we show that a network can be forced to learn a source-invariant representation.
arXiv Detail & Related papers (2020-08-04T07:41:15Z) - Deep Mining External Imperfect Data for Chest X-ray Disease Screening [57.40329813850719]
We argue that incorporating an external CXR dataset leads to imperfect training data, which raises the challenges.
We formulate the multi-label disease classification problem as weighted independent binary tasks according to the categories.
Our framework simultaneously models and tackles the domain and label discrepancies, enabling superior knowledge mining ability.
arXiv Detail & Related papers (2020-06-06T06:48:40Z) - IntrA: 3D Intracranial Aneurysm Dataset for Deep Learning [18.163031102785904]
We introduce an open-access 3D intracranial aneurysm dataset, IntrA, that makes the application of points-based and mesh-based classification and segmentation models available.
Our dataset can be used to diagnose intracranial aneurysms and to extract the neck for a clipping operation in medicine.
arXiv Detail & Related papers (2020-03-02T05:21:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.