Methods for generating and evaluating synthetic longitudinal patient
data: a systematic review
- URL: http://arxiv.org/abs/2309.12380v2
- Date: Wed, 6 Mar 2024 09:22:40 GMT
- Title: Methods for generating and evaluating synthetic longitudinal patient
data: a systematic review
- Authors: Katariina Perkonoja and Kari Auranen and Joni Virta
- Abstract summary: This paper presents a systematic review of methods for generating and evaluating synthetic longitudinal patient data.
The review adheres to the PRISMA guidelines and covers literature from five databases until the end of 2022.
The paper describes 17 methods, ranging from traditional simulation techniques to modern deep learning methods.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The proliferation of data in recent years has led to the advancement and
utilization of various statistical and deep learning techniques, thus
expediting research and development activities. However, not all industries
have benefited equally from the surge in data availability, partly due to legal
restrictions on data usage and privacy regulations, such as in medicine. To
address this issue, various statistical disclosure and privacy-preserving
methods have been proposed, including the use of synthetic data generation.
Synthetic data are generated based on some existing data, with the aim of
replicating them as closely as possible and acting as a proxy for real
sensitive data. This paper presents a systematic review of methods for
generating and evaluating synthetic longitudinal patient data, a prevalent data
type in medicine. The review adheres to the PRISMA guidelines and covers
literature from five databases until the end of 2022. The paper describes 17
methods, ranging from traditional simulation techniques to modern deep learning
methods. The collected information includes, but is not limited to, method
type, source code availability, and approaches used to assess resemblance,
utility, and privacy. Furthermore, the paper discusses practical guidelines and
key considerations for developing synthetic longitudinal data generation
methods.
Related papers
- Empirical Privacy Evaluations of Generative and Predictive Machine Learning Models -- A review and challenges for practice [0.3069335774032178]
It is crucial to empirically assess the privacy risks associated with the generated synthetic data before deploying generative technologies.
This paper outlines the key concepts and assumptions underlying empirical privacy evaluation in machine learning-based generative and predictive models.
arXiv Detail & Related papers (2024-11-19T12:19:28Z) - Tabular Data Synthesis with Differential Privacy: A Survey [24.500349285858597]
Data sharing is a prerequisite for collaborative innovation, enabling organizations to leverage diverse datasets for deeper insights.
Data synthesis tackles this by generating artificial datasets that preserve the statistical characteristics of real data.
Differentially private data synthesis has emerged as a promising approach to privacy-aware data sharing.
arXiv Detail & Related papers (2024-11-04T06:32:48Z) - A Comprehensive Survey on Data Augmentation [55.355273602421384]
Data augmentation is a technique that generates high-quality artificial data by manipulating existing data samples.
Existing literature surveys only focus on a certain type of specific modality data.
We propose a more enlightening taxonomy that encompasses data augmentation techniques for different common data modalities.
arXiv Detail & Related papers (2024-05-15T11:58:08Z) - Towards Biologically Plausible and Private Gene Expression Data
Generation [47.72947816788821]
Generative models trained with Differential Privacy (DP) are becoming increasingly prominent in the creation of synthetic data for downstream applications.
Existing literature, however, primarily focuses on basic benchmarking datasets and tends to report promising results only for elementary metrics and relatively simple data distributions.
We initiate a systematic analysis of how DP generative models perform in their natural application scenarios, specifically focusing on real-world gene expression data.
arXiv Detail & Related papers (2024-02-07T14:39:11Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Statistical properties and privacy guarantees of an original
distance-based fully synthetic data generation method [0.0]
This work shows the technical feasibility of generating publicly releasable synthetic data using a multi-step framework.
By successfully assessing the quality of data produced using a novel multi-step synthetic data generation framework, we showed the technical and conceptual soundness of the Open-CESP initiative.
arXiv Detail & Related papers (2023-10-10T12:29:57Z) - A Unified View of Differentially Private Deep Generative Modeling [60.72161965018005]
Data with privacy concerns comes with stringent regulations that frequently prohibited data access and data sharing.
Overcoming these obstacles is key for technological progress in many real-world application scenarios that involve privacy sensitive data.
Differentially private (DP) data publishing provides a compelling solution, where only a sanitized form of the data is publicly released.
arXiv Detail & Related papers (2023-09-27T14:38:16Z) - SoK: Privacy-Preserving Data Synthesis [72.92263073534899]
This paper focuses on privacy-preserving data synthesis (PPDS) by providing a comprehensive overview, analysis, and discussion of the field.
We put forth a master recipe that unifies two prominent strands of research in PPDS: statistical methods and deep learning (DL)-based methods.
arXiv Detail & Related papers (2023-07-05T08:29:31Z) - Synthetic data generation for a longitudinal cohort study -- Evaluation,
method extension and reproduction of published data analysis results [0.32593385688760446]
In the health sector, access to individual-level data is often challenging due to privacy concerns.
A promising alternative is the generation of fully synthetic data.
In this study, we use a state-of-the-art synthetic data generation method.
arXiv Detail & Related papers (2023-05-12T13:13:55Z) - A Multifaceted Benchmarking of Synthetic Electronic Health Record
Generation Models [15.165156674288623]
We introduce a generalizable benchmarking framework to appraise key characteristics of synthetic health data.
Results show that there is a utility-privacy tradeoff for sharing synthetic EHR data.
arXiv Detail & Related papers (2022-08-02T03:44:45Z) - DC-BENCH: Dataset Condensation Benchmark [79.18718490863908]
This work provides the first large-scale standardized benchmark on dataset condensation.
It consists of a suite of evaluations to comprehensively reflect the generability and effectiveness of condensation methods.
The benchmark library is open-sourced to facilitate future research and application.
arXiv Detail & Related papers (2022-07-20T03:54:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.