MammoClean: Toward Reproducible and Bias-Aware AI in Mammography through Dataset Harmonization
- URL: http://arxiv.org/abs/2511.02400v1
- Date: Tue, 04 Nov 2025 09:29:46 GMT
- Title: MammoClean: Toward Reproducible and Bias-Aware AI in Mammography through Dataset Harmonization
- Authors: Yalda Zafari, Hongyi Pan, Gorkem Durak, Ulas Bagci, Essam A. Rashed, Mohamed Mabrok,
- Abstract summary: MammoClean is a framework for standardization and bias quantification in mammography datasets.<n>By using MammoClean to identify and mitigate bias sources, researchers can construct unified multi-dataset training corpora.<n>MammoClean provides an essential, reproducible pipeline for bias-aware AI development in mammography.
- Score: 11.587767922555628
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The development of clinically reliable artificial intelligence (AI) systems for mammography is hindered by profound heterogeneity in data quality, metadata standards, and population distributions across public datasets. This heterogeneity introduces dataset-specific biases that severely compromise the generalizability of the model, a fundamental barrier to clinical deployment. We present MammoClean, a public framework for standardization and bias quantification in mammography datasets. MammoClean standardizes case selection, image processing (including laterality and intensity correction), and unifies metadata into a consistent multi-view structure. We provide a comprehensive review of breast anatomy, imaging characteristics, and public mammography datasets to systematically identify key sources of bias. Applying MammoClean to three heterogeneous datasets (CBIS-DDSM, TOMPEI-CMMD, VinDr-Mammo), we quantify substantial distributional shifts in breast density and abnormality prevalence. Critically, we demonstrate the direct impact of data corruption: AI models trained on corrupted datasets exhibit significant performance degradation compared to their curated counterparts. By using MammoClean to identify and mitigate bias sources, researchers can construct unified multi-dataset training corpora that enable development of robust models with superior cross-domain generalization. MammoClean provides an essential, reproducible pipeline for bias-aware AI development in mammography, facilitating fairer comparisons and advancing the creation of safe, effective systems that perform equitably across diverse patient populations and clinical settings. The open-source code is publicly available from: https://github.com/Minds-R-Lab/MammoClean.
Related papers
- Who Does Your Algorithm Fail? Investigating Age and Ethnic Bias in the MAMA-MIA Dataset [8.774604259603304]
We audit the fairness of the automated segmentation labels provided in the breast cancer tumor segmentation dataset MAMA-MIA.<n>Our analysis reveals an intrinsic age-related bias against younger patients that continues to persist even after controlling for confounding factors, such as data source.
arXiv Detail & Related papers (2025-10-31T12:20:31Z) - Bias and Generalizability of Foundation Models across Datasets in Breast Mammography [4.117899774444893]
We explore the fairness and bias of foundation models (FMs) for breast mammography classification.<n>We leverage a large pool of datasets from diverse sources-including data from underrepresented regions and an in-house dataset.<n>Our experiments show that while modality-specific pre-training of FMs enhances performance, classifiers trained on features from individual datasets fail to generalize across domains.
arXiv Detail & Related papers (2025-05-14T06:56:17Z) - AI Alignment in Medical Imaging: Unveiling Hidden Biases Through Counterfactual Analysis [16.21270312974956]
We introduce a novel statistical framework to evaluate the dependency of medical imaging ML models on sensitive attributes, such as demographics.<n>We present a practical algorithm that combines conditional latent diffusion models with statistical hypothesis testing to identify and quantify such biases.
arXiv Detail & Related papers (2025-04-28T09:28:25Z) - MedMNIST-C: Comprehensive benchmark and improved classifier robustness by simulating realistic image corruptions [0.13108652488669734]
integration of neural-network-based systems into clinical practice is limited by challenges related to domain generalization and robustness.
We create and open-source MedMNIST-C, a benchmark dataset based on the MedMNIST+ collection covering 12 datasets and 9 imaging modalities.
arXiv Detail & Related papers (2024-06-25T13:20:39Z) - A Demographic-Conditioned Variational Autoencoder for fMRI Distribution Sampling and Removal of Confounds [49.34500499203579]
We create a variational autoencoder (VAE)-based model, DemoVAE, to decorrelate fMRI features from demographics.
We generate high-quality synthetic fMRI data based on user-supplied demographics.
arXiv Detail & Related papers (2024-05-13T17:49:20Z) - Fairness-Aware Data Augmentation for Cardiac MRI using Text-Conditioned Diffusion Models [1.6581402323174208]
We propose a method to alleviate imbalances inherent in datasets through the generation of synthetic data.<n>We adopt ControlNet based on a denoising diffusion probabilistic model to condition on text assembled from patient metadata and cardiac geometry.<n>Our experiments demonstrate the effectiveness of the proposed approach in mitigating dataset imbalances.
arXiv Detail & Related papers (2024-03-28T15:41:43Z) - ArSDM: Colonoscopy Images Synthesis with Adaptive Refinement Semantic
Diffusion Models [69.9178140563928]
Colonoscopy analysis is essential for assisting clinical diagnosis and treatment.
The scarcity of annotated data limits the effectiveness and generalization of existing methods.
We propose an Adaptive Refinement Semantic Diffusion Model (ArSDM) to generate colonoscopy images that benefit the downstream tasks.
arXiv Detail & Related papers (2023-09-03T07:55:46Z) - Source-Free Collaborative Domain Adaptation via Multi-Perspective
Feature Enrichment for Functional MRI Analysis [55.03872260158717]
Resting-state MRI functional (rs-fMRI) is increasingly employed in multi-site research to aid neurological disorder analysis.
Many methods have been proposed to reduce fMRI heterogeneity between source and target domains.
But acquiring source data is challenging due to concerns and/or data storage burdens in multi-site studies.
We design a source-free collaborative domain adaptation framework for fMRI analysis, where only a pretrained source model and unlabeled target data are accessible.
arXiv Detail & Related papers (2023-08-24T01:30:18Z) - Differentiable Agent-based Epidemiology [71.81552021144589]
We introduce GradABM: a scalable, differentiable design for agent-based modeling that is amenable to gradient-based learning with automatic differentiation.
GradABM can quickly simulate million-size populations in few seconds on commodity hardware, integrate with deep neural networks and ingest heterogeneous data sources.
arXiv Detail & Related papers (2022-07-20T07:32:02Z) - G-MIND: An End-to-End Multimodal Imaging-Genetics Framework for
Biomarker Identification and Disease Classification [49.53651166356737]
We propose a novel deep neural network architecture to integrate imaging and genetics data, as guided by diagnosis, that provides interpretable biomarkers.
We have evaluated our model on a population study of schizophrenia that includes two functional MRI (fMRI) paradigms and Single Nucleotide Polymorphism (SNP) data.
arXiv Detail & Related papers (2021-01-27T19:28:04Z) - Modeling Shared Responses in Neuroimaging Studies through MultiView ICA [94.31804763196116]
Group studies involving large cohorts of subjects are important to draw general conclusions about brain functional organization.
We propose a novel MultiView Independent Component Analysis model for group studies, where data from each subject are modeled as a linear combination of shared independent sources plus noise.
We demonstrate the usefulness of our approach first on fMRI data, where our model demonstrates improved sensitivity in identifying common sources among subjects.
arXiv Detail & Related papers (2020-06-11T17:29:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.