Synthetic data generation for a longitudinal cohort study -- Evaluation,
method extension and reproduction of published data analysis results
- URL: http://arxiv.org/abs/2305.07685v1
- Date: Fri, 12 May 2023 13:13:55 GMT
- Title: Synthetic data generation for a longitudinal cohort study -- Evaluation,
method extension and reproduction of published data analysis results
- Authors: Lisa K\"uhnel, Julian Schneider, Ines Perrar, Tim Adams, Fabian
Prasser, Ute N\"othlings, Holger Fr\"ohlich, Juliane Fluck
- Abstract summary: In the health sector, access to individual-level data is often challenging due to privacy concerns.
A promising alternative is the generation of fully synthetic data.
In this study, we use a state-of-the-art synthetic data generation method.
- Score: 0.32593385688760446
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Access to individual-level health data is essential for gaining new insights
and advancing science. In particular, modern methods based on artificial
intelligence rely on the availability of and access to large datasets. In the
health sector, access to individual-level data is often challenging due to
privacy concerns. A promising alternative is the generation of fully synthetic
data, i.e. data generated through a randomised process that have similar
statistical properties as the original data, but do not have a one-to-one
correspondence with the original individual-level records. In this study, we
use a state-of-the-art synthetic data generation method and perform in-depth
quality analyses of the generated data for a specific use case in the field of
nutrition. We demonstrate the need for careful analyses of synthetic data that
go beyond descriptive statistics and provide valuable insights into how to
realise the full potential of synthetic datasets. By extending the methods, but
also by thoroughly analysing the effects of sampling from a trained model, we
are able to largely reproduce significant real-world analysis results in the
chosen use case.
Related papers
- A Comprehensive Survey on Data Augmentation [55.355273602421384]
Data augmentation is a technique that generates high-quality artificial data by manipulating existing data samples.
Existing literature surveys only focus on a certain type of specific modality data.
We propose a more enlightening taxonomy that encompasses data augmentation techniques for different common data modalities.
arXiv Detail & Related papers (2024-05-15T11:58:08Z) - Best Practices and Lessons Learned on Synthetic Data [83.63271573197026]
The success of AI models relies on the availability of large, diverse, and high-quality datasets.
Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns.
arXiv Detail & Related papers (2024-04-11T06:34:17Z) - Boosting Data Analytics With Synthetic Volume Expansion [3.568650932986342]
This article explores the effectiveness of statistical methods on synthetic data and the privacy risks of synthetic data.
A key finding within this framework is the generational effect, which reveals that the error rate of statistical methods on synthetic data decreases with the addition of more synthetic data but may eventually rise or stabilize.
arXiv Detail & Related papers (2023-10-27T01:57:27Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Statistical properties and privacy guarantees of an original
distance-based fully synthetic data generation method [0.0]
This work shows the technical feasibility of generating publicly releasable synthetic data using a multi-step framework.
By successfully assessing the quality of data produced using a novel multi-step synthetic data generation framework, we showed the technical and conceptual soundness of the Open-CESP initiative.
arXiv Detail & Related papers (2023-10-10T12:29:57Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic
Data [91.52783572568214]
Synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs.
We discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data.
arXiv Detail & Related papers (2023-04-07T16:38:40Z) - Synthetic Data in Human Analysis: A Survey [16.562921709882865]
Survey is intended for researchers and practitioners in the field of human analysis.
We conduct a survey that summarises current state-of-the-art methods and the main benefits of using synthetic data.
We also provide an overview of publicly available synthetic datasets and generation models.
arXiv Detail & Related papers (2022-08-19T07:32:34Z) - Delving into High-Quality Synthetic Face Occlusion Segmentation Datasets [83.749895930242]
We propose two techniques for producing high-quality naturalistic synthetic occluded faces.
We empirically show the effectiveness and robustness of both methods, even for unseen occlusions.
We present two high-resolution real-world occluded face datasets with fine-grained annotations, RealOcc and RealOcc-Wild.
arXiv Detail & Related papers (2022-05-12T17:03:57Z) - Measuring Utility and Privacy of Synthetic Genomic Data [3.635321290763711]
We provide the first evaluation of the utility and the privacy protection of five state-of-the-art models for generating synthetic genomic data.
Overall, there is no single approach for generating synthetic genomic data that performs well across the board.
arXiv Detail & Related papers (2021-02-05T17:41:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.