Privacy-Preserving Generative Modeling and Clinical Validation of Longitudinal Health Records for Chronic Disease
- URL: http://arxiv.org/abs/2512.00434v1
- Date: Sat, 29 Nov 2025 10:16:14 GMT
- Title: Privacy-Preserving Generative Modeling and Clinical Validation of Longitudinal Health Records for Chronic Disease
- Authors: Benjamin D. Ballyk, Ankit Gupta, Sujay Konda, Kavitha Subramanian, Chris Landon, Ahmed Ammar Naseer, Georg Maierhofer, Sumanth Swaminathan, Vasudevan Venkateshwaran,
- Abstract summary: We enhance a state-of-the-art time-series generative model to better handle longitudinal clinical data while incorporating quantifiable privacy safeguards.<n>Our non-private model (Augmented TimeGAN) outperforms transformer- and flow-based models on statistical metrics in several datasets.<n>Our private model (DP-TimeGAN) maintains a mean authenticity of 0.778 on the CKD dataset, outperforming existing state-of-the-art models on the privacy-utility frontier.
- Score: 1.334430331852034
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data privacy is a critical challenge in modern medical workflows as the adoption of electronic patient records has grown rapidly. Stringent data protection regulations limit access to clinical records for training and integrating machine learning models that have shown promise in improving diagnostic accuracy and personalized care outcomes. Synthetic data offers a promising alternative; however, current generative models either struggle with time-series data or lack formal privacy guaranties. In this paper, we enhance a state-of-the-art time-series generative model to better handle longitudinal clinical data while incorporating quantifiable privacy safeguards. Using real data from chronic kidney disease and ICU patients, we evaluate our method through statistical tests, a Train-on-Synthetic-Test-on-Real (TSTR) setup, and expert clinical review. Our non-private model (Augmented TimeGAN) outperforms transformer- and flow-based models on statistical metrics in several datasets, while our private model (DP-TimeGAN) maintains a mean authenticity of 0.778 on the CKD dataset, outperforming existing state-of-the-art models on the privacy-utility frontier. Both models achieve performance comparable to real data in clinician evaluations, providing robust input data necessary for developing models for complex chronic conditions without compromising data privacy.
Related papers
- Generative clinical time series models trained on moderate amounts of patient data are privacy preserving [1.7728232380247864]
We use a battery of privacy attacks to audit state-of-the-art hospital time series models, trained on the public MIMIC-IV dataset.<n>Results show that established privacy attacks are ineffective against generated multivariate clinical time series when synthetic data generators are trained on large enough datasets.
arXiv Detail & Related papers (2026-02-11T08:23:54Z) - CURENet: Combining Unified Representations for Efficient Chronic Disease Prediction [24.569877750738286]
We present CURENet, a multimodal model that integrates unstructured clinical notes, lab tests, and patients' time-series data.<n>CURENet has been capable of capturing the intricate interaction between different forms of clinical data and creating a more reliable predictive model for chronic illnesses.
arXiv Detail & Related papers (2025-11-14T15:52:22Z) - Controllable Synthetic Clinical Note Generation with Privacy Guarantees [7.1366477372157995]
In this paper, we introduce a novel method to "clone" datasets containing Personal Health Information (PHI)
Our approach ensures that the cloned datasets retain the essential characteristics and utility of the original data without compromising patient privacy.
We conduct utility testing to evaluate the performance of machine learning models trained on the cloned datasets.
arXiv Detail & Related papers (2024-09-12T07:38:34Z) - Synthesizing Multimodal Electronic Health Records via Predictive Diffusion Models [69.06149482021071]
We propose a novel EHR data generation model called EHRPD.
It is a diffusion-based model designed to predict the next visit based on the current one while also incorporating time interval estimation.
We conduct experiments on two public datasets and evaluate EHRPD from fidelity, privacy, and utility perspectives.
arXiv Detail & Related papers (2024-06-20T02:20:23Z) - Protect and Extend -- Using GANs for Synthetic Data Generation of
Time-Series Medical Records [1.9749268648715583]
This research compares state-of-the-art GAN-based models for synthetic data generation to generate time-series synthetic medical records of dementia patients.
Our experiments indicate the superiority of the privacy-preserving GAN (PPGAN) model over other models regarding privacy preservation.
arXiv Detail & Related papers (2024-02-21T10:24:34Z) - IGNITE: Individualized GeNeration of Imputations in Time-series Electronic health records [6.630372114304835]
We propose a novel deep-learning model that learns the underlying patient dynamics to generate personalized values conditioning on an individual's demographic characteristics and treatments.<n>Our proposed model, IGNITE, utilise a conditional dual-variational autoencoder augmented with dual-stage attention to generate missing values for an individual.<n>We show that IGNITE outperforms state-of-the-art approaches in missing data reconstruction and task prediction.
arXiv Detail & Related papers (2024-01-09T07:57:21Z) - MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data
Augmentation [58.93221876843639]
This paper introduces a novel, end-to-end diffusion-based risk prediction model, named MedDiffusion.
It enhances risk prediction performance by creating synthetic patient data during training to enlarge sample space.
It discerns hidden relationships between patient visits using a step-wise attention mechanism, enabling the model to automatically retain the most vital information for generating high-quality data.
arXiv Detail & Related papers (2023-10-04T01:36:30Z) - TREEMENT: Interpretable Patient-Trial Matching via Personalized Dynamic
Tree-Based Memory Network [54.332862955411656]
Clinical trials are critical for drug development but often suffer from expensive and inefficient patient recruitment.
In recent years, machine learning models have been proposed for speeding up patient recruitment via automatically matching patients with clinical trials.
We introduce a dynamic tree-based memory network model named TREEMENT to provide accurate and interpretable patient trial matching.
arXiv Detail & Related papers (2023-07-19T12:35:09Z) - Safe AI for health and beyond -- Monitoring to transform a health
service [51.8524501805308]
We will assess the infrastructure required to monitor the outputs of a machine learning algorithm.
We will present two scenarios with examples of monitoring and updates of models.
arXiv Detail & Related papers (2023-03-02T17:27:45Z) - Private, fair and accurate: Training large-scale, privacy-preserving AI models in medical imaging [47.99192239793597]
We evaluated the effect of privacy-preserving training of AI models regarding accuracy and fairness compared to non-private training.
Our study shows that -- under the challenging realistic circumstances of a real-life clinical dataset -- the privacy-preserving training of diagnostic deep learning models is possible with excellent diagnostic accuracy and fairness.
arXiv Detail & Related papers (2023-02-03T09:49:13Z) - Generating Synthetic Mixed-type Longitudinal Electronic Health Records
for Artificial Intelligent Applications [9.374416143268892]
generative adversarial network (GAN) entitled EHR-M-GAN which synthesizes textitmixed-type timeseries EHR data.
We have validated EHR-M-GAN on three publicly-available intensive care unit databases with records from a total of 141,488 unique patients.
arXiv Detail & Related papers (2021-12-22T17:17:34Z) - Privacy-preserving medical image analysis [53.4844489668116]
We present PriMIA, a software framework designed for privacy-preserving machine learning (PPML) in medical imaging.
We show significantly better classification performance of a securely aggregated federated learning model compared to human experts on unseen datasets.
We empirically evaluate the framework's security against a gradient-based model inversion attack.
arXiv Detail & Related papers (2020-12-10T13:56:00Z) - Hide-and-Seek Privacy Challenge [88.49671206936259]
The NeurIPS 2020 Hide-and-Seek Privacy Challenge is a novel two-tracked competition to accelerate progress in tackling both problems.
In our head-to-head format, participants in the synthetic data generation track (i.e. "hiders") and the patient re-identification track (i.e. "seekers") are directly pitted against each other by way of a new, high-quality intensive care time-series dataset.
arXiv Detail & Related papers (2020-07-23T15:50:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.