SynDiffix: More accurate synthetic structured data
- URL: http://arxiv.org/abs/2311.09628v1
- Date: Thu, 16 Nov 2023 07:17:06 GMT
- Title: SynDiffix: More accurate synthetic structured data
- Authors: Paul Francis, Cristian Berneanu, Edon Gashi,
- Abstract summary: This paper introduces SynDiffix, a mechanism for generating statistically accurate, anonymous synthetic data for structured data.
ML models generated from SynDiffix are twice as accurate, marginal and column pairs data quality is one to two orders of magnitude more accurate, and execution time is two orders of magnitude faster.
- Score: 0.5461938536945723
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces SynDiffix, a mechanism for generating statistically accurate, anonymous synthetic data for structured data. Recent open source and commercial systems use Generative Adversarial Networks or Transformed Auto Encoders to synthesize data, and achieve anonymity through overfitting-avoidance. By contrast, SynDiffix exploits traditional mechanisms of aggregation, noise addition, and suppression among others. Compared to CTGAN, ML models generated from SynDiffix are twice as accurate, marginal and column pairs data quality is one to two orders of magnitude more accurate, and execution time is two orders of magnitude faster. Compared to the best commercial product we measured (MostlyAI), ML model accuracy is comparable, marginal and pairs accuracy is 5 to 10 times better, and execution time is an order of magnitude faster. Similar to the other approaches, SynDiffix anonymization is very strong. This paper describes SynDiffix and compares its performance with other popular open source and commercial systems.
Related papers
- SWE-Synth: Synthesizing Verifiable Bug-Fix Data to Enable Large Language Models in Resolving Real-World Bugs [10.70881967278009]
We present SWE- Synth, a framework for synthesizing realistic verifiable, and process-aware bug-fix datasets at the repository level.
Compared to manually curated datasets, our method scales with minimal human effort while preserving contextual richness and correctness.
Our results highlight the potential of synthetic, agent-generated data to advance the state of the art in APR and software engineering automation.
arXiv Detail & Related papers (2025-04-20T22:37:43Z) - Scaling Laws of Synthetic Data for Language Models [132.67350443447611]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets.
Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z) - Little Giants: Synthesizing High-Quality Embedding Data at Scale [71.352883755806]
We introduce SPEED, a framework that aligns open-source small models to efficiently generate large-scale embedding data.
SPEED uses only less than 1/10 of the GPT API calls, outperforming the state-of-the-art embedding model E5_mistral when both are trained solely on their synthetic data.
arXiv Detail & Related papers (2024-10-24T10:47:30Z) - SAU: A Dual-Branch Network to Enhance Long-Tailed Recognition via Generative Models [9.340077455871736]
Long-tailed distributions in image recognition pose a considerable challenge due to the severe imbalance between a few dominant classes.
Recently, the use of large generative models to create synthetic data for image classification has been realized.
We propose the use of synthetic data as a complement to long-tailed datasets to eliminate the impact of data imbalance.
arXiv Detail & Related papers (2024-08-29T05:33:59Z) - Improving Grammatical Error Correction via Contextual Data Augmentation [49.746484518527716]
We propose a synthetic data construction method based on contextual augmentation.
Specifically, we combine rule-based substitution with model-based generation.
We also propose a relabeling-based data cleaning method to mitigate the effects of noisy labels in synthetic data.
arXiv Detail & Related papers (2024-06-25T10:49:56Z) - A Comparison of SynDiffix Multi-table versus Single-table Synthetic Data [0.7252027234425334]
SynDiffix is a new open-source tool for structured data synthesis.
It has anonymization features that allow it to generate multiple synthetic tables while maintaining strong anonymity.
This paper compares SynDiffix with 15 other synthetic data techniques using the SDNIST analysis framework.
arXiv Detail & Related papers (2024-03-13T12:26:50Z) - Fast Dual-Regularized Autoencoder for Sparse Biological Data [65.268245109828]
We develop a shallow autoencoder for the dual neighborhood-regularized matrix completion problem.
We demonstrate the speed and accuracy advantage of our approach over the existing state-of-the-art in predicting drug-target interactions and drug-disease associations.
arXiv Detail & Related papers (2024-01-30T01:28:48Z) - Trading Off Scalability, Privacy, and Performance in Data Synthesis [11.698554876505446]
We introduce (a) the Howso engine, and (b) our proposed random projection based synthetic data generation framework.
We show that the synthetic data generated by Howso engine has good privacy and accuracy, which results the best overall score.
Our proposed random projection based framework can generate synthetic data with highest accuracy score, and has the fastest scalability.
arXiv Detail & Related papers (2023-12-09T02:04:25Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - CORE: Common Random Reconstruction for Distributed Optimization with
Provable Low Communication Complexity [110.50364486645852]
Communication complexity has become a major bottleneck for speeding up training and scaling up machine numbers.
We propose Common Om REOm, which can be used to compress information transmitted between machines.
arXiv Detail & Related papers (2023-09-23T08:45:27Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.