Trading Off Scalability, Privacy, and Performance in Data Synthesis
- URL: http://arxiv.org/abs/2312.05436v1
- Date: Sat, 9 Dec 2023 02:04:25 GMT
- Title: Trading Off Scalability, Privacy, and Performance in Data Synthesis
- Authors: Xiao Ling, Tim Menzies, Christopher Hazard, Jack Shu, Jacob Beel
- Abstract summary: We introduce (a) the Howso engine, and (b) our proposed random projection based synthetic data generation framework.
We show that the synthetic data generated by Howso engine has good privacy and accuracy, which results the best overall score.
Our proposed random projection based framework can generate synthetic data with highest accuracy score, and has the fastest scalability.
- Score: 11.698554876505446
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Synthetic data has been widely applied in the real world recently. One
typical example is the creation of synthetic data for privacy concerned
datasets. In this scenario, synthetic data substitute the real data which
contains the privacy information, and is used to public testing for machine
learning models. Another typical example is the unbalance data over-sampling
which the synthetic data is generated in the region of minority samples to
balance the positive and negative ratio when training the machine learning
models. In this study, we concentrate on the first example, and introduce (a)
the Howso engine, and (b) our proposed random projection based synthetic data
generation framework. We evaluate these two algorithms on the aspects of
privacy preservation and accuracy, and compare them to the two state-of-the-art
synthetic data generation algorithms DataSynthesizer and Synthetic Data Vault.
We show that the synthetic data generated by Howso engine has good privacy and
accuracy, which results the best overall score. On the other hand, our proposed
random projection based framework can generate synthetic data with highest
accuracy score, and has the fastest scalability.
Related papers
- Little Giants: Synthesizing High-Quality Embedding Data at Scale [71.352883755806]
We introduce SPEED, a framework that aligns open-source small models to efficiently generate large-scale embedding data.
SPEED uses only less than 1/10 of the GPT API calls, outperforming the state-of-the-art embedding model E5_mistral when both are trained solely on their synthetic data.
arXiv Detail & Related papers (2024-10-24T10:47:30Z) - Strong statistical parity through fair synthetic data [0.0]
This paper explores the creation of synthetic data that embodies Fairness by Design.
A downstream model trained on such synthetic data provides fair predictions across all thresholds.
arXiv Detail & Related papers (2023-11-06T10:06:30Z) - Assessment of Differentially Private Synthetic Data for Utility and
Fairness in End-to-End Machine Learning Pipelines for Tabular Data [3.555830838738963]
Differentially private (DP) synthetic data sets are a solution for sharing data while preserving the privacy of individual data providers.
We identify the most effective synthetic data generation techniques for training and evaluating machine learning models.
arXiv Detail & Related papers (2023-10-30T03:37:16Z) - TarGEN: Targeted Data Generation with Large Language Models [51.87504111286201]
TarGEN is a multi-step prompting strategy for generating high-quality synthetic datasets.
We augment TarGEN with a method known as self-correction empowering LLMs to rectify inaccurately labeled instances.
A comprehensive analysis of the synthetic dataset compared to the original dataset reveals similar or higher levels of dataset complexity and diversity.
arXiv Detail & Related papers (2023-10-27T03:32:17Z) - Boosting Data Analytics With Synthetic Volume Expansion [3.568650932986342]
This article explores the effectiveness of statistical methods on synthetic data and the privacy risks of synthetic data.
A key finding within this framework is the generational effect, which reveals that the error rate of statistical methods on synthetic data decreases with the addition of more synthetic data but may eventually rise or stabilize.
arXiv Detail & Related papers (2023-10-27T01:57:27Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic
Data [91.52783572568214]
Synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs.
We discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data.
arXiv Detail & Related papers (2023-04-07T16:38:40Z) - Bias Mitigated Learning from Differentially Private Synthetic Data: A
Cautionary Tale [13.881022208028751]
Bias can affect all analyses as the synthetic data distribution is an inconsistent estimate of the real-data distribution.
We propose several bias mitigation strategies using privatized likelihood ratios.
We show that bias mitigation provides simple and effective privacy-compliant augmentation for general applications of synthetic data.
arXiv Detail & Related papers (2021-08-24T19:56:44Z) - Measuring Utility and Privacy of Synthetic Genomic Data [3.635321290763711]
We provide the first evaluation of the utility and the privacy protection of five state-of-the-art models for generating synthetic genomic data.
Overall, there is no single approach for generating synthetic genomic data that performs well across the board.
arXiv Detail & Related papers (2021-02-05T17:41:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.