An Investigation of Memorization Risk in Healthcare Foundation Models
- URL: http://arxiv.org/abs/2510.12950v1
- Date: Tue, 14 Oct 2025 19:55:07 GMT
- Title: An Investigation of Memorization Risk in Healthcare Foundation Models
- Authors: Sana Tonekaboni, Lena Stempfle, Adibvafa Fallahpour, Walter Gerych, Marzyeh Ghassemi,
- Abstract summary: We introduce a suite of black-box evaluation tests to assess privacy-related risks in foundation models trained on structured EHR data.<n>Our framework includes methods for probing memorization at both the embedding and generative levels, and aims to distinguish between model generalization and harmful memorization in clinically relevant settings.
- Score: 21.94560578418064
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Foundation models trained on large-scale de-identified electronic health records (EHRs) hold promise for clinical applications. However, their capacity to memorize patient information raises important privacy concerns. In this work, we introduce a suite of black-box evaluation tests to assess privacy-related memorization risks in foundation models trained on structured EHR data. Our framework includes methods for probing memorization at both the embedding and generative levels, and aims to distinguish between model generalization and harmful memorization in clinically relevant settings. We contextualize memorization in terms of its potential to compromise patient privacy, particularly for vulnerable subgroups. We validate our approach on a publicly available EHR foundation model and release an open-source toolkit to facilitate reproducible and collaborative privacy assessments in healthcare AI.
Related papers
- Integrating Genomics into Multimodal EHR Foundation Models [56.31910745104141]
This paper introduces an innovative EHR foundation model that integrates Polygenic Risk Scores (PRS) as a foundational data modality.<n>The framework aims to learn complex relationships between clinical data and genetic predispositions.<n>This approach is pivotal for unlocking new insights into disease prediction, proactive health management, risk stratification, and personalized treatment strategies.
arXiv Detail & Related papers (2025-10-24T15:56:40Z) - Exploring Membership Inference Vulnerabilities in Clinical Large Language Models [42.52690697965999]
We present an exploratory empirical study on membership inference vulnerabilities in clinical large language models (LLMs)<n>Using a state-of-the-art clinical question-answering model, Llemr, we evaluate both canonical loss-based attacks and a domain-motivated paraphrasing-based perturbation strategy.<n>Results motivate continued development of context-aware, domain-specific privacy evaluations and defenses.
arXiv Detail & Related papers (2025-10-21T14:27:48Z) - An Analytical Approach to Privacy and Performance Trade-Offs in Healthcare Data Sharing [1.2179548969182572]
Older adults, frequently hospitalized patients, and racial minorities are vulnerable to privacy attacks.<n>We evaluate three anonymization methods-$k$-anonymity, the technique by Zheng et al., and the MO-OBAM model-based on their ability to reduce re-identification risk.
arXiv Detail & Related papers (2025-08-25T21:36:47Z) - Medical Red Teaming Protocol of Language Models: On the Importance of User Perspectives in Healthcare Settings [48.096652370210016]
We introduce a safety evaluation protocol tailored to the medical domain in both patient user and clinician user perspectives.<n>This is the first work to define safety evaluation criteria for medical LLMs through targeted red-teaming taking three different points of view.
arXiv Detail & Related papers (2025-07-09T19:38:58Z) - Differential Privacy-Driven Framework for Enhancing Heart Disease Prediction [7.473832609768354]
Machine learning is critical in healthcare, supporting personalized treatment, early disease detection, predictive analytics, image interpretation, drug discovery, efficient operations, and patient monitoring.<n>In this paper, we utilize machine learning methodologies, including differential privacy and federated learning, to develop privacy-preserving models.<n>Our results show that using a federated learning model with differential privacy achieved a test accuracy of 85%, ensuring patient data remained secure and private throughout the process.
arXiv Detail & Related papers (2025-04-25T01:27:40Z) - Towards Privacy-aware Mental Health AI Models: Advances, Challenges, and Opportunities [58.61680631581921]
Mental health disorders create profound personal and societal burdens, yet conventional diagnostics are resource-intensive and limit accessibility.<n>This paper examines these challenges and proposes solutions, including anonymization, synthetic data, and privacy-preserving training.<n>It aims to advance reliable, privacy-aware AI tools that support clinical decision-making and improve mental health outcomes.
arXiv Detail & Related papers (2025-02-01T15:10:02Z) - FedDP: Privacy-preserving method based on federated learning for histopathology image segmentation [2.864354559973703]
This paper addresses the dispersed nature and privacy sensitivity of medical image data by employing a federated learning framework.
The proposed method, FedDP, minimally impacts model accuracy while effectively safeguarding the privacy of cancer pathology image data.
arXiv Detail & Related papers (2024-11-07T08:02:58Z) - FEDMEKI: A Benchmark for Scaling Medical Foundation Models via Federated Knowledge Injection [83.54960238236548]
FEDMEKI not only preserves data privacy but also enhances the capability of medical foundation models.
FEDMEKI allows medical foundation models to learn from a broader spectrum of medical knowledge without direct data exposure.
arXiv Detail & Related papers (2024-08-17T15:18:56Z) - Improving Multiple Sclerosis Lesion Segmentation Across Clinical Sites:
A Federated Learning Approach with Noise-Resilient Training [75.40980802817349]
Deep learning models have shown promise for automatically segmenting MS lesions, but the scarcity of accurately annotated data hinders progress in this area.
We introduce a Decoupled Hard Label Correction (DHLC) strategy that considers the imbalanced distribution and fuzzy boundaries of MS lesions.
We also introduce a Centrally Enhanced Label Correction (CELC) strategy, which leverages the aggregated central model as a correction teacher for all sites.
arXiv Detail & Related papers (2023-08-31T00:36:10Z) - Private, fair and accurate: Training large-scale, privacy-preserving AI models in medical imaging [47.99192239793597]
We evaluated the effect of privacy-preserving training of AI models regarding accuracy and fairness compared to non-private training.
Our study shows that -- under the challenging realistic circumstances of a real-life clinical dataset -- the privacy-preserving training of diagnostic deep learning models is possible with excellent diagnostic accuracy and fairness.
arXiv Detail & Related papers (2023-02-03T09:49:13Z) - Differentially private federated deep learning for multi-site medical
image segmentation [56.30543374146002]
Collaborative machine learning techniques such as federated learning (FL) enable the training of models on effectively larger datasets without data transfer.
Recent initiatives have demonstrated that segmentation models trained with FL can achieve performance similar to locally trained models.
However, FL is not a fully privacy-preserving technique and privacy-centred attacks can disclose confidential patient data.
arXiv Detail & Related papers (2021-07-06T12:57:32Z) - Defending Medical Image Diagnostics against Privacy Attacks using
Generative Methods [10.504951891644474]
We develop and evaluate a privacy defense protocol based on using a generative adversarial network (GAN)
We validate the proposed method on retinal diagnostics AI used for diabetic retinopathy that bears the risk of possibly leaking private information.
arXiv Detail & Related papers (2021-03-04T15:02:57Z) - Clinical Outcome Prediction from Admission Notes using Self-Supervised
Knowledge Integration [55.88616573143478]
Outcome prediction from clinical text can prevent doctors from overlooking possible risks.
Diagnoses at discharge, procedures performed, in-hospital mortality and length-of-stay prediction are four common outcome prediction targets.
We propose clinical outcome pre-training to integrate knowledge about patient outcomes from multiple public sources.
arXiv Detail & Related papers (2021-02-08T10:26:44Z) - Privacy-preserving medical image analysis [53.4844489668116]
We present PriMIA, a software framework designed for privacy-preserving machine learning (PPML) in medical imaging.
We show significantly better classification performance of a securely aggregated federated learning model compared to human experts on unseen datasets.
We empirically evaluate the framework's security against a gradient-based model inversion attack.
arXiv Detail & Related papers (2020-12-10T13:56:00Z) - Anonymizing Data for Privacy-Preserving Federated Learning [3.3673553810697827]
We propose the first syntactic approach for offering privacy in the context of federated learning.
Our approach aims to maximize utility or model performance, while supporting a defensible level of privacy.
We perform a comprehensive empirical evaluation on two important problems in the healthcare domain, using real-world electronic health data of 1 million patients.
arXiv Detail & Related papers (2020-02-21T02:30:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.