Measuring Utility and Privacy of Synthetic Genomic Data
- URL: http://arxiv.org/abs/2102.03314v1
- Date: Fri, 5 Feb 2021 17:41:01 GMT
- Title: Measuring Utility and Privacy of Synthetic Genomic Data
- Authors: Bristena Oprisanu and Georgi Ganev and Emiliano De Cristofaro
- Abstract summary: We provide the first evaluation of the utility and the privacy protection of five state-of-the-art models for generating synthetic genomic data.
Overall, there is no single approach for generating synthetic genomic data that performs well across the board.
- Score: 3.635321290763711
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Genomic data provides researchers with an invaluable source of information to
advance progress in biomedical research, personalized medicine, and drug
development. At the same time, however, this data is extremely sensitive, which
makes data sharing, and consequently availability, problematic if not outright
impossible. As a result, organizations have begun to experiment with sharing
synthetic data, which should mirror the real data's salient characteristics,
without exposing it. In this paper, we provide the first evaluation of the
utility and the privacy protection of five state-of-the-art models for
generating synthetic genomic data.
First, we assess the performance of the synthetic data on a number of common
tasks, such as allele and population statistics as well as linkage
disequilibrium and principal component analysis. Then, we study the
susceptibility of the data to membership inference attacks, i.e., inferring
whether a target record was part of the data used to train the model producing
the synthetic dataset. Overall, there is no single approach for generating
synthetic genomic data that performs well across the board. We show how the
size and the nature of the training dataset matter, especially in the case of
generative models. While some combinations of datasets and models produce
synthetic data with distributions close to the real data, there often are
target data points that are vulnerable to membership inference. Our measurement
framework can be used by practitioners to assess the risks of deploying
synthetic genomic data in the wild, and will serve as a benchmark tool for
researchers and practitioners in the future.
Related papers
- Best Practices and Lessons Learned on Synthetic Data [83.63271573197026]
The success of AI models relies on the availability of large, diverse, and high-quality datasets.
Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns.
arXiv Detail & Related papers (2024-04-11T06:34:17Z) - The Real Deal Behind the Artificial Appeal: Inferential Utility of Tabular Synthetic Data [40.165159490379146]
We show that the rate of false-positive findings (type 1 error) will be unacceptably high, even when the estimates are unbiased.
Despite the use of a previously proposed correction factor, this problem persists for deep generative models.
arXiv Detail & Related papers (2023-12-13T02:04:41Z) - Trading Off Scalability, Privacy, and Performance in Data Synthesis [11.698554876505446]
We introduce (a) the Howso engine, and (b) our proposed random projection based synthetic data generation framework.
We show that the synthetic data generated by Howso engine has good privacy and accuracy, which results the best overall score.
Our proposed random projection based framework can generate synthetic data with highest accuracy score, and has the fastest scalability.
arXiv Detail & Related papers (2023-12-09T02:04:25Z) - Boosting Data Analytics With Synthetic Volume Expansion [3.568650932986342]
This article explores the effectiveness of statistical methods on synthetic data and the privacy risks of synthetic data.
A key finding within this framework is the generational effect, which reveals that the error rate of statistical methods on synthetic data decreases with the addition of more synthetic data but may eventually rise or stabilize.
arXiv Detail & Related papers (2023-10-27T01:57:27Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Synthetic data generation for a longitudinal cohort study -- Evaluation,
method extension and reproduction of published data analysis results [0.32593385688760446]
In the health sector, access to individual-level data is often challenging due to privacy concerns.
A promising alternative is the generation of fully synthetic data.
In this study, we use a state-of-the-art synthetic data generation method.
arXiv Detail & Related papers (2023-05-12T13:13:55Z) - Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic
Data [91.52783572568214]
Synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs.
We discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data.
arXiv Detail & Related papers (2023-04-07T16:38:40Z) - Generating Realistic Synthetic Relational Data through Graph Variational
Autoencoders [47.89542334125886]
We combine the variational autoencoder framework with graph neural networks to generate realistic synthetic relational databases.
The results indicate that real databases' structures are accurately preserved in the resulting synthetic datasets.
arXiv Detail & Related papers (2022-11-30T10:40:44Z) - Synthetic Data in Human Analysis: A Survey [16.562921709882865]
Survey is intended for researchers and practitioners in the field of human analysis.
We conduct a survey that summarises current state-of-the-art methods and the main benefits of using synthetic data.
We also provide an overview of publicly available synthetic datasets and generation models.
arXiv Detail & Related papers (2022-08-19T07:32:34Z) - Representative & Fair Synthetic Data [68.8204255655161]
We present a framework to incorporate fairness constraints into the self-supervised learning process.
We generate a representative as well as fair version of the UCI Adult census data set.
We consider representative & fair synthetic data a promising future building block to teach algorithms not on historic worlds, but rather on the worlds that we strive to live in.
arXiv Detail & Related papers (2021-04-07T09:19:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.