Synthesizing Realistic Test Data without Breaking Privacy
- URL: http://arxiv.org/abs/2602.05833v1
- Date: Thu, 05 Feb 2026 16:22:01 GMT
- Title: Synthesizing Realistic Test Data without Breaking Privacy
- Authors: Laura Plein, Alexi Turcotte, Arina Hallemans, Andreas Zeller,
- Abstract summary: This paper explores the feasibility of producing a synthetic test dataset with the same statistical properties as the original one.<n>We use a test generator (a fuzzer) to produce test data from an input specification, preserving constraints set by the original data.<n>By evolving samples and determining "good samples" with the discriminator, we can generate privacy-preserving data that follows the same statistical distributions are the original dataset.
- Score: 3.975269455219312
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: There is a need for synthetic training and test datasets that replicate statistical distributions of original datasets without compromising their confidentiality. A lot of research has been done in leveraging Generative Adversarial Networks (GANs) for synthetic data generation. However, the resulting models are either not accurate enough or are still vulnerable to membership inference attacks (MIA) or dataset reconstruction attacks since the original data has been leveraged in the training process. In this paper, we explore the feasibility of producing a synthetic test dataset with the same statistical properties as the original one, with only indirectly leveraging the original data in the generation process. The approach is inspired by GANs, with a generation step and a discrimination step. However, in our approach, we use a test generator (a fuzzer) to produce test data from an input specification, preserving constraints set by the original data; a discriminator model determines how close we are to the original data. By evolving samples and determining "good samples" with the discriminator, we can generate privacy-preserving data that follows the same statistical distributions are the original dataset, leading to a similar utility as the original data. We evaluated our approach on four datasets that have been used to evaluate the state-of-the-art techniques. Our experiments highlight the potential of our approach towards generating synthetic datasets that have high utility while preserving privacy.
Related papers
- Valid Inference with Imperfect Synthetic Data [39.10587411316875]
We introduce a new estimator based on generalized method of moments.<n>We find that interactions between the moment residuals of synthetic data and those of real data can greatly improve estimates of the target parameter.
arXiv Detail & Related papers (2025-08-08T18:32:52Z) - Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - Your Data, My Model: Learning Who Really Helps in Federated Learning [47.0304843350031]
A key challenge is determining which peers are most beneficial for collaboration.<n>We propose a simple and privacy-preserving method to select relevant collaborators.<n>Our approach enables model-agnostic, data-driven peer selection for personalized federated learning.
arXiv Detail & Related papers (2024-09-03T17:12:21Z) - Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data [51.41288763521186]
Retrieval-augmented generation (RAG) enhances the outputs of language models by integrating relevant information retrieved from external knowledge sources.<n>RAG systems may face severe privacy risks when retrieving private data.<n>We propose using synthetic data as a privacy-preserving alternative for the retrieval data.
arXiv Detail & Related papers (2024-06-20T22:53:09Z) - Does Differentially Private Synthetic Data Lead to Synthetic Discoveries? [1.9573380763700712]
The evaluation is conducted in terms of the tests' Type I and Type II errors.
A large portion of the evaluation results expressed dramatically inflated Type I errors, especially at privacy budget levels of $epsilonleq 1$.
arXiv Detail & Related papers (2024-03-20T14:03:57Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - DP-PQD: Privately Detecting Per-Query Gaps In Synthetic Data Generated By Black-Box Mechanisms [17.562365686511818]
We present a novel framework named DP-PQD (differentially-private per-query decider) to detect if the query answers on the private and synthetic datasets are within a user-specified threshold of each other.
We give a suite of private algorithms for per-query deciders for count, sum, and median queries, analyze their properties, and evaluate them experimentally.
arXiv Detail & Related papers (2023-09-15T17:38:59Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Synthetic data generation for a longitudinal cohort study -- Evaluation,
method extension and reproduction of published data analysis results [0.32593385688760446]
In the health sector, access to individual-level data is often challenging due to privacy concerns.
A promising alternative is the generation of fully synthetic data.
In this study, we use a state-of-the-art synthetic data generation method.
arXiv Detail & Related papers (2023-05-12T13:13:55Z) - Membership Inference Attacks against Synthetic Data through Overfitting
Detection [84.02632160692995]
We argue for a realistic MIA setting that assumes the attacker has some knowledge of the underlying data distribution.
We propose DOMIAS, a density-based MIA model that aims to infer membership by targeting local overfitting of the generative model.
arXiv Detail & Related papers (2023-02-24T11:27:39Z) - FairGen: Fair Synthetic Data Generation [0.3149883354098941]
We propose a pipeline to generate fairer synthetic data independent of the GAN architecture.
We claim that while generating synthetic data most GANs amplify bias present in the training data but by removing these bias inducing samples, GANs essentially focuses more on real informative samples.
arXiv Detail & Related papers (2022-10-24T08:13:47Z) - A Kernelised Stein Statistic for Assessing Implicit Generative Models [10.616967871198689]
We propose a principled procedure to assess the quality of a synthetic data generator.
The sample size from the synthetic data generator can be as large as desired, while the size of the observed data, which the generator aims to emulate is fixed.
arXiv Detail & Related papers (2022-05-31T23:40:21Z) - Data-SUITE: Data-centric identification of in-distribution incongruous
examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data.
We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.