On the Usefulness of Synthetic Tabular Data Generation
- URL: http://arxiv.org/abs/2306.15636v1
- Date: Tue, 27 Jun 2023 17:26:23 GMT
- Title: On the Usefulness of Synthetic Tabular Data Generation
- Authors: Dionysis Manousakas and Serg\"ul Ayd\"ore
- Abstract summary: It is commonly believed that synthetic data can be used for both data exchange and boosting machine learning (ML) training.
Privacy-preserving synthetic data generation can accelerate data exchange for downstream tasks, but there is not enough evidence to show how or why synthetic data can boost ML training.
- Score: 3.04585143845864
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite recent advances in synthetic data generation, the scientific
community still lacks a unified consensus on its usefulness. It is commonly
believed that synthetic data can be used for both data exchange and boosting
machine learning (ML) training. Privacy-preserving synthetic data generation
can accelerate data exchange for downstream tasks, but there is not enough
evidence to show how or why synthetic data can boost ML training. In this
study, we benchmarked ML performance using synthetic tabular data for four use
cases: data sharing, data augmentation, class balancing, and data
summarization. We observed marginal improvements for the balancing use case on
some datasets. However, we conclude that there is not enough evidence to claim
that synthetic tabular data is useful for ML training.
Related papers
- Little Giants: Synthesizing High-Quality Embedding Data at Scale [71.352883755806]
We introduce SPEED, a framework that aligns open-source small models to efficiently generate large-scale embedding data.
SPEED uses only less than 1/10 of the GPT API calls, outperforming the state-of-the-art embedding model E5_mistral when both are trained solely on their synthetic data.
arXiv Detail & Related papers (2024-10-24T10:47:30Z) - Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - On the Equivalency, Substitutability, and Flexibility of Synthetic Data [9.459709213597707]
We investigate the equivalency of synthetic data to real-world data, the substitutability of synthetic data for real data, and the flexibility of synthetic data generators.
Our results suggest that synthetic data not only enhances model performance but also demonstrates substitutability for real data, with 60% to 80% replacement without performance loss.
arXiv Detail & Related papers (2024-03-24T17:21:32Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Synthetic is all you need: removing the auxiliary data assumption for
membership inference attacks against synthetic data [9.061271587514215]
We show how this assumption can be removed, allowing for MIAs to be performed using only the synthetic data.
Our results show that MIAs are still successful, across two real-world datasets and two synthetic data generators.
arXiv Detail & Related papers (2023-07-04T13:16:03Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic
Data [91.52783572568214]
Synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs.
We discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data.
arXiv Detail & Related papers (2023-04-07T16:38:40Z) - Synthcity: facilitating innovative use cases of synthetic data in
different data modalities [86.52703093858631]
Synthcity is an open-source software package for innovative use cases of synthetic data in ML fairness, privacy and augmentation.
Synthcity provides the practitioners with a single access point to cutting edge research and tools in synthetic data.
arXiv Detail & Related papers (2023-01-18T14:49:54Z) - Evolving GANs: When Contradictions Turn into Compliance [11.353579556329962]
We propose a GAN game which provides improved discriminator accuracy under limited data settings, while generating realistic synthetic data.
This provides the added advantage that now the generated data can be used for other similar tasks.
arXiv Detail & Related papers (2021-06-18T06:51:35Z) - Foundations of Bayesian Learning from Synthetic Data [1.6249267147413522]
We use a Bayesian paradigm to characterise the updating of model parameters when learning on synthetic data.
Recent results from general Bayesian updating support a novel and robust approach to synthetic-learning founded on decision theory.
arXiv Detail & Related papers (2020-11-16T21:49:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.