Related papers: Methods for generating and evaluating synthetic longitudinal patient data: a systematic review

Methods for generating and evaluating synthetic longitudinal patient data: a systematic review

URL: http://arxiv.org/abs/2309.12380v2
Date: Wed, 6 Mar 2024 09:22:40 GMT
Title: Methods for generating and evaluating synthetic longitudinal patient data: a systematic review
Authors: Katariina Perkonoja and Kari Auranen and Joni Virta
Abstract summary: This paper presents a systematic review of methods for generating and evaluating synthetic longitudinal patient data. The review adheres to the PRISMA guidelines and covers literature from five databases until the end of 2022. The paper describes 17 methods, ranging from traditional simulation techniques to modern deep learning methods.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The proliferation of data in recent years has led to the advancement and utilization of various statistical and deep learning techniques, thus expediting research and development activities. However, not all industries have benefited equally from the surge in data availability, partly due to legal restrictions on data usage and privacy regulations, such as in medicine. To address this issue, various statistical disclosure and privacy-preserving methods have been proposed, including the use of synthetic data generation. Synthetic data are generated based on some existing data, with the aim of replicating them as closely as possible and acting as a proxy for real sensitive data. This paper presents a systematic review of methods for generating and evaluating synthetic longitudinal patient data, a prevalent data type in medicine. The review adheres to the PRISMA guidelines and covers literature from five databases until the end of 2022. The paper describes 17 methods, ranging from traditional simulation techniques to modern deep learning methods. The collected information includes, but is not limited to, method type, source code availability, and approaches used to assess resemblance, utility, and privacy. Furthermore, the paper discusses practical guidelines and key considerations for developing synthetic longitudinal data generation methods.

Related papers

A Survey on Tabular Data Generation: Utility, Alignment, Fidelity, Privacy, and Beyond [53.56796220109518]
Different use cases demand synthetic data to comply with different requirements to be useful in practice. Four types of requirements are reviewed: utility of the synthetic data, alignment of the synthetic data with domain-specific knowledge, statistical fidelity of the synthetic data distribution compared to the real data distribution, and privacy-preserving capabilities. We discuss future directions for the field, along with opportunities to improve the current evaluation methods.
arXiv Detail & Related papers (2025-03-07T21:47:11Z)
Empirical Privacy Evaluations of Generative and Predictive Machine Learning Models -- A review and challenges for practice [0.3069335774032178]
It is crucial to empirically assess the privacy risks associated with the generated synthetic data before deploying generative technologies. This paper outlines the key concepts and assumptions underlying empirical privacy evaluation in machine learning-based generative and predictive models.
arXiv Detail & Related papers (2024-11-19T12:19:28Z)
Tabular Data Synthesis with Differential Privacy: A Survey [24.500349285858597]
Data sharing is a prerequisite for collaborative innovation, enabling organizations to leverage diverse datasets for deeper insights. Data synthesis tackles this by generating artificial datasets that preserve the statistical characteristics of real data. Differentially private data synthesis has emerged as a promising approach to privacy-aware data sharing.
arXiv Detail & Related papers (2024-11-04T06:32:48Z)
A Comprehensive Survey on Data Augmentation [55.355273602421384]
Data augmentation is a technique that generates high-quality artificial data by manipulating existing data samples. Existing literature surveys only focus on a certain type of specific modality data. We propose a more enlightening taxonomy that encompasses data augmentation techniques for different common data modalities.
arXiv Detail & Related papers (2024-05-15T11:58:08Z)
Towards Biologically Plausible and Private Gene Expression Data Generation [47.72947816788821]
Generative models trained with Differential Privacy (DP) are becoming increasingly prominent in the creation of synthetic data for downstream applications. Existing literature, however, primarily focuses on basic benchmarking datasets and tends to report promising results only for elementary metrics and relatively simple data distributions. We initiate a systematic analysis of how DP generative models perform in their natural application scenarios, specifically focusing on real-world gene expression data.
arXiv Detail & Related papers (2024-02-07T14:39:11Z)
A primer on synthetic health data [0.2770822269241974]
Recent advances in deep generative models have greatly expanded the potential to create realistic synthetic health datasets. These synthetic datasets aim to preserve the characteristics, patterns, and overall scientific conclusions without disclosing patient identity or sensitive information. However, many questions and challenges remain, including how to consistently evaluate a synthetic dataset's similarity and predictive utility.
arXiv Detail & Related papers (2024-01-31T08:13:35Z)
Scaling While Privacy Preserving: A Comprehensive Synthetic Tabular Data Generation and Evaluation in Learning Analytics [0.412484724941528]
Privacy poses a significant obstacle to the progress of learning analytics (LA), presenting challenges like inadequate anonymization and data misuse. Synthetic data emerges as a potential remedy, offering robust privacy protection. Prior LA research on synthetic data lacks thorough evaluation, essential for assessing the delicate balance between privacy and data utility.
arXiv Detail & Related papers (2024-01-12T20:27:55Z)
Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models. ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task. This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z)
Statistical properties and privacy guarantees of an original distance-based fully synthetic data generation method [0.0]
This work shows the technical feasibility of generating publicly releasable synthetic data using a multi-step framework. By successfully assessing the quality of data produced using a novel multi-step synthetic data generation framework, we showed the technical and conceptual soundness of the Open-CESP initiative.
arXiv Detail & Related papers (2023-10-10T12:29:57Z)
A Unified View of Differentially Private Deep Generative Modeling [60.72161965018005]
Data with privacy concerns comes with stringent regulations that frequently prohibited data access and data sharing. Overcoming these obstacles is key for technological progress in many real-world application scenarios that involve privacy sensitive data. Differentially private (DP) data publishing provides a compelling solution, where only a sanitized form of the data is publicly released.
arXiv Detail & Related papers (2023-09-27T14:38:16Z)
SoK: Privacy-Preserving Data Synthesis [72.92263073534899]
This paper focuses on privacy-preserving data synthesis (PPDS) by providing a comprehensive overview, analysis, and discussion of the field. We put forth a master recipe that unifies two prominent strands of research in PPDS: statistical methods and deep learning (DL)-based methods.
arXiv Detail & Related papers (2023-07-05T08:29:31Z)
Synthetic data generation for a longitudinal cohort study -- Evaluation, method extension and reproduction of published data analysis results [0.32593385688760446]
In the health sector, access to individual-level data is often challenging due to privacy concerns. A promising alternative is the generation of fully synthetic data. In this study, we use a state-of-the-art synthetic data generation method.
arXiv Detail & Related papers (2023-05-12T13:13:55Z)
Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic Data [91.52783572568214]
Synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs. We discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data.
arXiv Detail & Related papers (2023-04-07T16:38:40Z)
A Multifaceted Benchmarking of Synthetic Electronic Health Record Generation Models [15.165156674288623]
We introduce a generalizable benchmarking framework to appraise key characteristics of synthetic health data. Results show that there is a utility-privacy tradeoff for sharing synthetic EHR data.
arXiv Detail & Related papers (2022-08-02T03:44:45Z)
DC-BENCH: Dataset Condensation Benchmark [79.18718490863908]
This work provides the first large-scale standardized benchmark on dataset condensation. It consists of a suite of evaluations to comprehensively reflect the generability and effectiveness of condensation methods. The benchmark library is open-sourced to facilitate future research and application.
arXiv Detail & Related papers (2022-07-20T03:54:05Z)
Fidelity and Privacy of Synthetic Medical Data [0.0]
The digitization of medical records ushered in a new era of big data to clinical science. The need to share individual-level medical data continues to grow, and has never been more urgent. enthusiasm for the use of big data has been tempered by a fully appropriate concern for patient autonomy and privacy.
arXiv Detail & Related papers (2021-01-18T23:01:27Z)
Hide-and-Seek Privacy Challenge [88.49671206936259]
The NeurIPS 2020 Hide-and-Seek Privacy Challenge is a novel two-tracked competition to accelerate progress in tackling both problems. In our head-to-head format, participants in the synthetic data generation track (i.e. "hiders") and the patient re-identification track (i.e. "seekers") are directly pitted against each other by way of a new, high-quality intensive care time-series dataset.
arXiv Detail & Related papers (2020-07-23T15:50:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.