Generating High-quality Privacy-preserving Synthetic Data
- URL: http://arxiv.org/abs/2602.06390v1
- Date: Fri, 06 Feb 2026 05:03:49 GMT
- Title: Generating High-quality Privacy-preserving Synthetic Data
- Authors: David Yavo, Richard Khoury, Christophe Pere, Sadoune Ait Kaci Azzou,
- Abstract summary: We study a model agnostic post processing framework that can be applied on top of any synthetic data generator to improve this trade off.<n>We instantiate this framework for two neural generative models for tabular data, a feed forward generator and a variational autoencoder.<n>We evaluate it on three public datasets covering credit card transactions, cardiovascular health, and census based income.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Synthetic tabular data enables sharing and analysis of sensitive records, but its practical deployment requires balancing distributional fidelity, downstream utility, and privacy protection. We study a simple, model agnostic post processing framework that can be applied on top of any synthetic data generator to improve this trade off. First, a mode patching step repairs categories that are missing or severely underrepresented in the synthetic data, while largely preserving learned dependencies. Second, a k nearest neighbor filter replaces synthetic records that lie too close to real data points, enforcing a minimum distance between real and synthetic samples. We instantiate this framework for two neural generative models for tabular data, a feed forward generator and a variational autoencoder, and evaluate it on three public datasets covering credit card transactions, cardiovascular health, and census based income. We assess marginal and joint distributional similarity, the performance of models trained on synthetic data and evaluated on real data, and several empirical privacy indicators, including nearest neighbor distances and attribute inference attacks. With moderate thresholds between 0.2 and 0.35, the post processing reduces divergence between real and synthetic categorical distributions by up to 36 percent and improves a combined measure of pairwise dependence preservation by 10 to 14 percent, while keeping downstream predictive performance within about 1 percent of the unprocessed baseline. At the same time, distance based privacy indicators improve and the success rate of attribute inference attacks remains largely unchanged. These results provide practical guidance for selecting thresholds and applying post hoc repairs to improve the quality and empirical privacy of synthetic tabular data, while complementing approaches that provide formal differential privacy guarantees.
Related papers
- A Comprehensive Evaluation Framework for Synthetic Trip Data Generation in Public Transport [7.409483754602669]
Synthetic data offers a promising solution to the privacy and accessibility challenges of using smart card data in public transport research.<n>We propose a framework that systematically evaluates synthetic trip data across three complementary dimensions and three hierarchical levels.<n>Results show that synthetic data do not inherently guarantee privacy and there is no "one-size-fits-all" model.
arXiv Detail & Related papers (2025-10-28T12:52:47Z) - High-dimensional Analysis of Synthetic Data Selection [44.67519806837088]
We show that, for linear models, the covariance shift between the target distribution and the distribution of the synthetic data affects the generalization error.<n>Remarkably, the theoretical insights from linear models carry over to deep neural networks and generative models.
arXiv Detail & Related papers (2025-10-09T12:06:31Z) - Valid Inference with Imperfect Synthetic Data [39.10587411316875]
We introduce a new estimator based on generalized method of moments.<n>We find that interactions between the moment residuals of synthetic data and those of real data can greatly improve estimates of the target parameter.
arXiv Detail & Related papers (2025-08-08T18:32:52Z) - Enabling PSO-Secure Synthetic Data Sharing Using Diversity-Aware Diffusion Models [7.202078342390581]
We propose a generalisable framework for training diffusion models on personal data.<n>This leads to unpersonal synthetic datasets achieving performance within one percentage point of real-data models.
arXiv Detail & Related papers (2025-06-22T10:26:35Z) - Latent Noise Injection for Private and Statistically Aligned Synthetic Data Generation [7.240170769827935]
Synthetic data generation has become essential for scalable, privacy-preserving statistical analysis.<n>We propose a Latent Noise Injection method using Masked Autoregressive Flows (MAF)<n>Instead of directly sampling from the trained model, our method perturbs each data point in the latent space and maps it back to the data domain.
arXiv Detail & Related papers (2025-06-19T22:22:57Z) - Bounding Box-Guided Diffusion for Synthesizing Industrial Images and Segmentation Map [50.21082069320818]
We propose a novel diffusion-based pipeline for generating high-fidelity industrial datasets with minimal supervision.<n>Our approach conditions the diffusion model on enriched bounding box representations to produce precise segmentation masks.<n>Results demonstrate that diffusion-based synthesis can bridge the gap between artificial and real-world industrial data.
arXiv Detail & Related papers (2025-05-06T15:21:36Z) - Scaling Laws of Synthetic Data for Language Models [125.41600201811417]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets.<n>Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z) - Contrastive Learning-Based privacy metrics in Tabular Synthetic Datasets [40.67424997797513]
Synthetic data has garnered attention as a Privacy Enhancing Technology (PET) in sectors such as healthcare and finance.<n>Similarity-based methods aim at finding the level of similarity between training and synthetic data.<n>Attack-based methods conduce deliberate attacks on synthetic datasets.
arXiv Detail & Related papers (2025-02-19T15:52:23Z) - Reliability in Semantic Segmentation: Can We Use Synthetic Data? [69.28268603137546]
We show for the first time how synthetic data can be specifically generated to assess comprehensively the real-world reliability of semantic segmentation models.
This synthetic data is employed to evaluate the robustness of pretrained segmenters.
We demonstrate how our approach can be utilized to enhance the calibration and OOD detection capabilities of segmenters.
arXiv Detail & Related papers (2023-12-14T18:56:07Z) - Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data.
We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap.
Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z) - From Fake to Real: Pretraining on Balanced Synthetic Images to Prevent Spurious Correlations in Image Recognition [64.59093444558549]
We propose a simple, easy-to-implement, two-step training pipeline that we call From Fake to Real.
By training on real and synthetic data separately, FFR does not expose the model to the statistical differences between real and synthetic data.
Our experiments show that FFR improves worst group accuracy over the state-of-the-art by up to 20% over three datasets.
arXiv Detail & Related papers (2023-08-08T19:52:28Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - A Linear Reconstruction Approach for Attribute Inference Attacks against Synthetic Data [1.5293427903448022]
We introduce a new attribute inference attack against synthetic data.
We show that our attack can be highly accurate even on arbitrary records.
We then evaluate the tradeoff between protecting privacy and preserving statistical utility.
arXiv Detail & Related papers (2023-01-24T14:56:36Z) - CAFE: Learning to Condense Dataset by Aligning Features [72.99394941348757]
We propose a novel scheme to Condense dataset by Aligning FEatures (CAFE)
At the heart of our approach is an effective strategy to align features from the real and synthetic data across various scales.
We validate the proposed CAFE across various datasets, and demonstrate that it generally outperforms the state of the art.
arXiv Detail & Related papers (2022-03-03T05:58:49Z) - Holdout-Based Fidelity and Privacy Assessment of Mixed-Type Synthetic
Data [0.0]
AI-based data synthesis has seen rapid progress over the last several years, and is increasingly recognized for its promise to enable privacy-respecting data sharing.
We introduce and demonstrate a holdout-based empirical assessment framework for quantifying the fidelity as well as the privacy risk of synthetic data solutions.
arXiv Detail & Related papers (2021-04-01T17:30:23Z) - Differentially Private Federated Learning with Laplacian Smoothing [72.85272874099644]
Federated learning aims to protect data privacy by collaboratively learning a model without sharing private data among users.
An adversary may still be able to infer the private training data by attacking the released model.
Differential privacy provides a statistical protection against such attacks at the price of significantly degrading the accuracy or utility of the trained models.
arXiv Detail & Related papers (2020-05-01T04:28:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.