Trading Off Scalability, Privacy, and Performance in Data Synthesis
- URL: http://arxiv.org/abs/2312.05436v1
- Date: Sat, 9 Dec 2023 02:04:25 GMT
- Title: Trading Off Scalability, Privacy, and Performance in Data Synthesis
- Authors: Xiao Ling, Tim Menzies, Christopher Hazard, Jack Shu, Jacob Beel
- Abstract summary: We introduce (a) the Howso engine, and (b) our proposed random projection based synthetic data generation framework.
We show that the synthetic data generated by Howso engine has good privacy and accuracy, which results the best overall score.
Our proposed random projection based framework can generate synthetic data with highest accuracy score, and has the fastest scalability.
- Score: 11.698554876505446
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Synthetic data has been widely applied in the real world recently. One
typical example is the creation of synthetic data for privacy concerned
datasets. In this scenario, synthetic data substitute the real data which
contains the privacy information, and is used to public testing for machine
learning models. Another typical example is the unbalance data over-sampling
which the synthetic data is generated in the region of minority samples to
balance the positive and negative ratio when training the machine learning
models. In this study, we concentrate on the first example, and introduce (a)
the Howso engine, and (b) our proposed random projection based synthetic data
generation framework. We evaluate these two algorithms on the aspects of
privacy preservation and accuracy, and compare them to the two state-of-the-art
synthetic data generation algorithms DataSynthesizer and Synthetic Data Vault.
We show that the synthetic data generated by Howso engine has good privacy and
accuracy, which results the best overall score. On the other hand, our proposed
random projection based framework can generate synthetic data with highest
accuracy score, and has the fastest scalability.
Related papers
- Leveraging Programmatically Generated Synthetic Data for Differentially Private Diffusion Training [4.815212947276105]
Programmatically generated synthetic data has been used in differential private training for classification to avoid privacy leakage.
The model trained with synthetic data generates unrealistic random images, raising challenges to adapt synthetic data for generative models.
We propose DPSynGen, which leverages generated synthetic data in diffusion models to address this challenge.
arXiv Detail & Related papers (2024-12-13T04:22:23Z) - Second FRCSyn-onGoing: Winning Solutions and Post-Challenge Analysis to Improve Face Recognition with Synthetic Data [104.30479583607918]
2nd FRCSyn-onGoing challenge is based on the 2nd Face Recognition Challenge in the Era of Synthetic Data (FRCSyn), originally launched at CVPR 2024.
We focus on exploring the use of synthetic data both individually and in combination with real data to solve current challenges in face recognition.
arXiv Detail & Related papers (2024-12-02T11:12:01Z) - Little Giants: Synthesizing High-Quality Embedding Data at Scale [71.352883755806]
We introduce SPEED, a framework that aligns open-source small models to efficiently generate large-scale embedding data.
SPEED uses only less than 1/10 of the GPT API calls, outperforming the state-of-the-art embedding model E5_mistral when both are trained solely on their synthetic data.
arXiv Detail & Related papers (2024-10-24T10:47:30Z) - Strong statistical parity through fair synthetic data [0.0]
This paper explores the creation of synthetic data that embodies Fairness by Design.
A downstream model trained on such synthetic data provides fair predictions across all thresholds.
arXiv Detail & Related papers (2023-11-06T10:06:30Z) - Assessment of Differentially Private Synthetic Data for Utility and
Fairness in End-to-End Machine Learning Pipelines for Tabular Data [3.555830838738963]
Differentially private (DP) synthetic data sets are a solution for sharing data while preserving the privacy of individual data providers.
We identify the most effective synthetic data generation techniques for training and evaluating machine learning models.
arXiv Detail & Related papers (2023-10-30T03:37:16Z) - Boosting Data Analytics With Synthetic Volume Expansion [3.568650932986342]
This article explores the effectiveness of statistical methods on synthetic data and the privacy risks of synthetic data.
A key finding within this framework is the generational effect, which reveals that the error rate of statistical methods on synthetic data decreases with the addition of more synthetic data but may eventually rise or stabilize.
arXiv Detail & Related papers (2023-10-27T01:57:27Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic
Data [91.52783572568214]
Synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs.
We discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data.
arXiv Detail & Related papers (2023-04-07T16:38:40Z) - Measuring Utility and Privacy of Synthetic Genomic Data [3.635321290763711]
We provide the first evaluation of the utility and the privacy protection of five state-of-the-art models for generating synthetic genomic data.
Overall, there is no single approach for generating synthetic genomic data that performs well across the board.
arXiv Detail & Related papers (2021-02-05T17:41:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.