A Multifaceted Benchmarking of Synthetic Electronic Health Record
Generation Models
- URL: http://arxiv.org/abs/2208.01230v1
- Date: Tue, 2 Aug 2022 03:44:45 GMT
- Title: A Multifaceted Benchmarking of Synthetic Electronic Health Record
Generation Models
- Authors: Chao Yan, Yao Yan, Zhiyu Wan, Ziqi Zhang, Larsson Omberg, Justin
Guinney, Sean D. Mooney, Bradley A. Malin
- Abstract summary: We introduce a generalizable benchmarking framework to appraise key characteristics of synthetic health data.
Results show that there is a utility-privacy tradeoff for sharing synthetic EHR data.
- Score: 15.165156674288623
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Synthetic health data have the potential to mitigate privacy concerns when
sharing data to support biomedical research and the development of innovative
healthcare applications. Modern approaches for data generation based on machine
learning, generative adversarial networks (GAN) methods in particular, continue
to evolve and demonstrate remarkable potential. Yet there is a lack of a
systematic assessment framework to benchmark methods as they emerge and
determine which methods are most appropriate for which use cases. In this work,
we introduce a generalizable benchmarking framework to appraise key
characteristics of synthetic health data with respect to utility and privacy
metrics. We apply the framework to evaluate synthetic data generation methods
for electronic health records (EHRs) data from two large academic medical
centers with respect to several use cases. The results illustrate that there is
a utility-privacy tradeoff for sharing synthetic EHR data. The results further
indicate that no method is unequivocally the best on all criteria in each use
case, which makes it evident why synthetic data generation methods need to be
assessed in context.
Related papers
- Second FRCSyn-onGoing: Winning Solutions and Post-Challenge Analysis to Improve Face Recognition with Synthetic Data [104.30479583607918]
2nd FRCSyn-onGoing challenge is based on the 2nd Face Recognition Challenge in the Era of Synthetic Data (FRCSyn), originally launched at CVPR 2024.
We focus on exploring the use of synthetic data both individually and in combination with real data to solve current challenges in face recognition.
arXiv Detail & Related papers (2024-12-02T11:12:01Z) - NFDI4Health workflow and service for synthetic data generation, assessment and risk management [0.0]
A promising solution to this challenge is synthetic data generation.
This technique creates entirely new datasets that mimic the statistical properties of real data.
In this paper, we present the workflow and different services developed in the context of Germany's National Data Infrastructure project NFDI4Health.
arXiv Detail & Related papers (2024-08-08T14:08:39Z) - Synthesizing Multimodal Electronic Health Records via Predictive Diffusion Models [69.06149482021071]
We propose a novel EHR data generation model called EHRPD.
It is a diffusion-based model designed to predict the next visit based on the current one while also incorporating time interval estimation.
We conduct experiments on two public datasets and evaluate EHRPD from fidelity, privacy, and utility perspectives.
arXiv Detail & Related papers (2024-06-20T02:20:23Z) - A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models [3.672850225066168]
generative AI and large language models (LLMs) have opened up new avenues for producing synthetic data.
Despite the potential benefits, concerns regarding privacy leakage have surfaced.
We introduce SynEval, an open-source evaluation framework designed to assess the fidelity, utility, and privacy preservation of synthetically generated tabular data.
arXiv Detail & Related papers (2024-04-20T08:08:28Z) - Towards Biologically Plausible and Private Gene Expression Data
Generation [47.72947816788821]
Generative models trained with Differential Privacy (DP) are becoming increasingly prominent in the creation of synthetic data for downstream applications.
Existing literature, however, primarily focuses on basic benchmarking datasets and tends to report promising results only for elementary metrics and relatively simple data distributions.
We initiate a systematic analysis of how DP generative models perform in their natural application scenarios, specifically focusing on real-world gene expression data.
arXiv Detail & Related papers (2024-02-07T14:39:11Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Statistical properties and privacy guarantees of an original
distance-based fully synthetic data generation method [0.0]
This work shows the technical feasibility of generating publicly releasable synthetic data using a multi-step framework.
By successfully assessing the quality of data produced using a novel multi-step synthetic data generation framework, we showed the technical and conceptual soundness of the Open-CESP initiative.
arXiv Detail & Related papers (2023-10-10T12:29:57Z) - Synthetic data generation for a longitudinal cohort study -- Evaluation,
method extension and reproduction of published data analysis results [0.32593385688760446]
In the health sector, access to individual-level data is often challenging due to privacy concerns.
A promising alternative is the generation of fully synthetic data.
In this study, we use a state-of-the-art synthetic data generation method.
arXiv Detail & Related papers (2023-05-12T13:13:55Z) - Evaluation of the Synthetic Electronic Health Records [3.255030588361125]
This work outlines two metrics called Similarity and Uniqueness for sample-wise assessment of synthetic datasets.
We demonstrate the proposed notions with several state-of-the-art generative models to synthesise Cystic Fibrosis (CF) patients' electronic health records.
arXiv Detail & Related papers (2022-10-16T22:46:08Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - Privacy-preserving medical image analysis [53.4844489668116]
We present PriMIA, a software framework designed for privacy-preserving machine learning (PPML) in medical imaging.
We show significantly better classification performance of a securely aggregated federated learning model compared to human experts on unseen datasets.
We empirically evaluate the framework's security against a gradient-based model inversion attack.
arXiv Detail & Related papers (2020-12-10T13:56:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.