Related papers: Controllable Synthetic Clinical Note Generation with Privacy Guarantees

Controllable Synthetic Clinical Note Generation with Privacy Guarantees

URL: http://arxiv.org/abs/2409.07809v1
Date: Thu, 12 Sep 2024 07:38:34 GMT
Title: Controllable Synthetic Clinical Note Generation with Privacy Guarantees
Authors: Tal Baumel, Andre Manoel, Daniel Jones, Shize Su, Huseyin Inan, Aaron, Bornstein, Robert Sim,
Abstract summary: In this paper, we introduce a novel method to "clone" datasets containing Personal Health Information (PHI) Our approach ensures that the cloned datasets retain the essential characteristics and utility of the original data without compromising patient privacy. We conduct utility testing to evaluate the performance of machine learning models trained on the cloned datasets.
Score: 7.1366477372157995
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In the field of machine learning, domain-specific annotated data is an invaluable resource for training effective models. However, in the medical domain, this data often includes Personal Health Information (PHI), raising significant privacy concerns. The stringent regulations surrounding PHI limit the availability and sharing of medical datasets, which poses a substantial challenge for researchers and practitioners aiming to develop advanced machine learning models. In this paper, we introduce a novel method to "clone" datasets containing PHI. Our approach ensures that the cloned datasets retain the essential characteristics and utility of the original data without compromising patient privacy. By leveraging differential-privacy techniques and a novel fine-tuning task, our method produces datasets that are free from identifiable information while preserving the statistical properties necessary for model training. We conduct utility testing to evaluate the performance of machine learning models trained on the cloned datasets. The results demonstrate that our cloned datasets not only uphold privacy standards but also enhance model performance compared to those trained on traditional anonymized datasets. This work offers a viable solution for the ethical and effective utilization of sensitive medical data in machine learning, facilitating progress in medical research and the development of robust predictive models.

Related papers

Generative clinical time series models trained on moderate amounts of patient data are privacy preserving [1.7728232380247864]
We use a battery of privacy attacks to audit state-of-the-art hospital time series models, trained on the public MIMIC-IV dataset.<n>Results show that established privacy attacks are ineffective against generated multivariate clinical time series when synthetic data generators are trained on large enough datasets.
arXiv Detail & Related papers (2026-02-11T08:23:54Z)
Privacy-Preserving Generative Modeling and Clinical Validation of Longitudinal Health Records for Chronic Disease [1.334430331852034]
We enhance a state-of-the-art time-series generative model to better handle longitudinal clinical data while incorporating quantifiable privacy safeguards.<n>Our non-private model (Augmented TimeGAN) outperforms transformer- and flow-based models on statistical metrics in several datasets.<n>Our private model (DP-TimeGAN) maintains a mean authenticity of 0.778 on the CKD dataset, outperforming existing state-of-the-art models on the privacy-utility frontier.
arXiv Detail & Related papers (2025-11-29T10:16:14Z)
A text-to-tabular approach to generate synthetic patient data using LLMs [0.3628457733531155]
We propose an approach to generate synthetic patient data that does not require access to the original data. We leverage prior medical knowledge and in-context learning capabilities of large language models to generate realistic patient data.
arXiv Detail & Related papers (2024-12-06T16:10:40Z)
Empirical Privacy Evaluations of Generative and Predictive Machine Learning Models -- A review and challenges for practice [0.3069335774032178]
It is crucial to empirically assess the privacy risks associated with the generated synthetic data before deploying generative technologies. This paper outlines the key concepts and assumptions underlying empirical privacy evaluation in machine learning-based generative and predictive models.
arXiv Detail & Related papers (2024-11-19T12:19:28Z)
An Efficient Contrastive Unimodal Pretraining Method for EHR Time Series Data [35.943089444017666]
We propose an efficient method of contrastive pretraining tailored for long clinical timeseries data. Our model demonstrates the ability to impute missing measurements, providing clinicians with deeper insights into patient conditions.
arXiv Detail & Related papers (2024-10-11T19:05:25Z)
How Good Are Synthetic Medical Images? An Empirical Study with Lung Ultrasound [0.3312417881789094]
Adding synthetic training data using generative models offers a low-cost method to deal with the data scarcity challenge. We show that training with both synthetic and real data outperforms training with real data alone.
arXiv Detail & Related papers (2023-10-05T15:42:53Z)
MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data Augmentation [58.93221876843639]
This paper introduces a novel, end-to-end diffusion-based risk prediction model, named MedDiffusion. It enhances risk prediction performance by creating synthetic patient data during training to enlarge sample space. It discerns hidden relationships between patient visits using a step-wise attention mechanism, enabling the model to automatically retain the most vital information for generating high-quality data.
arXiv Detail & Related papers (2023-10-04T01:36:30Z)
Textual Data Augmentation for Patient Outcomes Prediction [67.72545656557858]
We propose a novel data augmentation method to generate artificial clinical notes in patients' Electronic Health Records. We fine-tune the generative language model GPT-2 to synthesize labeled text with the original training data. We evaluate our method on the most common patient outcome, i.e., the 30-day readmission rate.
arXiv Detail & Related papers (2022-11-13T01:07:23Z)
Private Set Generation with Discriminative Information [63.851085173614]
Differentially private data generation is a promising solution to the data privacy challenge. Existing private generative models are struggling with the utility of synthetic samples. We introduce a simple yet effective method that greatly improves the sample utility of state-of-the-art approaches.
arXiv Detail & Related papers (2022-11-07T10:02:55Z)
FLOP: Federated Learning on Medical Datasets using Partial Networks [84.54663831520853]
COVID-19 Disease due to the novel coronavirus has caused a shortage of medical resources. Different data-driven deep learning models have been developed to mitigate the diagnosis of COVID-19. The data itself is still scarce due to patient privacy concerns. We propose a simple yet effective algorithm, named textbfFederated textbfL textbfon Medical datasets using textbfPartial Networks (FLOP)
arXiv Detail & Related papers (2021-02-10T01:56:58Z)
Differentially Private Synthetic Medical Data Generation using Convolutional GANs [7.2372051099165065]
We develop a differentially private framework for synthetic data generation using R'enyi differential privacy. Our approach builds on convolutional autoencoders and convolutional generative adversarial networks to preserve some of the critical characteristics of the generated synthetic data. We demonstrate that our model outperforms existing state-of-the-art models under the same privacy budget.
arXiv Detail & Related papers (2020-12-22T01:03:49Z)
Privacy-preserving medical image analysis [53.4844489668116]
We present PriMIA, a software framework designed for privacy-preserving machine learning (PPML) in medical imaging. We show significantly better classification performance of a securely aggregated federated learning model compared to human experts on unseen datasets. We empirically evaluate the framework's security against a gradient-based model inversion attack.
arXiv Detail & Related papers (2020-12-10T13:56:00Z)
Self-Training with Improved Regularization for Sample-Efficient Chest X-Ray Classification [80.00316465793702]
We present a deep learning framework that enables robust modeling in challenging scenarios. Our results show that using 85% lesser labeled data, we can build predictive models that match the performance of classifiers trained in a large-scale data setting.
arXiv Detail & Related papers (2020-05-03T02:36:00Z)
Anonymizing Data for Privacy-Preserving Federated Learning [3.3673553810697827]
We propose the first syntactic approach for offering privacy in the context of federated learning. Our approach aims to maximize utility or model performance, while supporting a defensible level of privacy. We perform a comprehensive empirical evaluation on two important problems in the healthcare domain, using real-world electronic health data of 1 million patients.
arXiv Detail & Related papers (2020-02-21T02:30:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.