Holdout-Based Fidelity and Privacy Assessment of Mixed-Type Synthetic
Data
- URL: http://arxiv.org/abs/2104.00635v1
- Date: Thu, 1 Apr 2021 17:30:23 GMT
- Title: Holdout-Based Fidelity and Privacy Assessment of Mixed-Type Synthetic
Data
- Authors: Michael Platzer and Thomas Reutterer
- Abstract summary: AI-based data synthesis has seen rapid progress over the last several years, and is increasingly recognized for its promise to enable privacy-respecting data sharing.
We introduce and demonstrate a holdout-based empirical assessment framework for quantifying the fidelity as well as the privacy risk of synthetic data solutions.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: AI-based data synthesis has seen rapid progress over the last several years,
and is increasingly recognized for its promise to enable privacy-respecting
high-fidelity data sharing. However, adequately evaluating the quality of
generated synthetic datasets is still an open challenge. We introduce and
demonstrate a holdout-based empirical assessment framework for quantifying the
fidelity as well as the privacy risk of synthetic data solutions for mixed-type
tabular data. Measuring fidelity is based on statistical distances of
lower-dimensional marginal distributions, which provide a model-free and
easy-to-communicate empirical metric for the representativeness of a synthetic
dataset. Privacy risk is assessed by calculating the individual-level distances
to closest record with respect to the training data. By showing that the
synthetic samples are just as close to the training as to the holdout data, we
yield strong evidence that the synthesizer indeed learned to generalize
patterns and is independent of individual training records. We demonstrate the
presented framework for seven distinct synthetic data solutions across four
mixed-type datasets and compare these to more traditional statistical
disclosure techniques. The results highlight the need to systematically assess
the fidelity just as well as the privacy of these emerging class of synthetic
data generators.
Related papers
- A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models [3.672850225066168]
generative AI and large language models (LLMs) have opened up new avenues for producing synthetic data.
Despite the potential benefits, concerns regarding privacy leakage have surfaced.
We introduce SynEval, an open-source evaluation framework designed to assess the fidelity, utility, and privacy preservation of synthetically generated tabular data.
arXiv Detail & Related papers (2024-04-20T08:08:28Z) - Best Practices and Lessons Learned on Synthetic Data [83.63271573197026]
The success of AI models relies on the availability of large, diverse, and high-quality datasets.
Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns.
arXiv Detail & Related papers (2024-04-11T06:34:17Z) - SDFR: Synthetic Data for Face Recognition Competition [51.9134406629509]
Large-scale face recognition datasets are collected by crawling the Internet and without individuals' consent, raising legal, ethical, and privacy concerns.
Recently several works proposed generating synthetic face recognition datasets to mitigate concerns in web-crawled face recognition datasets.
This paper presents the summary of the Synthetic Data for Face Recognition (SDFR) Competition held in conjunction with the 18th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2024)
The SDFR competition was split into two tasks, allowing participants to train face recognition systems using new synthetic datasets and/or existing ones.
arXiv Detail & Related papers (2024-04-06T10:30:31Z) - Conformalised data synthesis with statistical quality guarantees [0.0]
Data synthesis is a promising technique to address the demand of data-hungry models.
But reliably assessing the quality of a'synthesiser' model's output is an open research question.
We have designed a unique confident data synthesis algorithm that introduces statistical confidence guarantees.
arXiv Detail & Related papers (2023-12-14T14:44:08Z) - Boosting Data Analytics With Synthetic Volume Expansion [3.568650932986342]
This article explores the effectiveness of statistical methods on synthetic data and the privacy risks of synthetic data.
A key finding within this framework is the generational effect, which reveals that the error rate of statistical methods on synthetic data decreases with the addition of more synthetic data but may eventually rise or stabilize.
arXiv Detail & Related papers (2023-10-27T01:57:27Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Statistical properties and privacy guarantees of an original
distance-based fully synthetic data generation method [0.0]
This work shows the technical feasibility of generating publicly releasable synthetic data using a multi-step framework.
By successfully assessing the quality of data produced using a novel multi-step synthetic data generation framework, we showed the technical and conceptual soundness of the Open-CESP initiative.
arXiv Detail & Related papers (2023-10-10T12:29:57Z) - Synthetic data generation for a longitudinal cohort study -- Evaluation,
method extension and reproduction of published data analysis results [0.32593385688760446]
In the health sector, access to individual-level data is often challenging due to privacy concerns.
A promising alternative is the generation of fully synthetic data.
In this study, we use a state-of-the-art synthetic data generation method.
arXiv Detail & Related papers (2023-05-12T13:13:55Z) - Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic
Data [91.52783572568214]
Synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs.
We discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data.
arXiv Detail & Related papers (2023-04-07T16:38:40Z) - CAFE: Learning to Condense Dataset by Aligning Features [72.99394941348757]
We propose a novel scheme to Condense dataset by Aligning FEatures (CAFE)
At the heart of our approach is an effective strategy to align features from the real and synthetic data across various scales.
We validate the proposed CAFE across various datasets, and demonstrate that it generally outperforms the state of the art.
arXiv Detail & Related papers (2022-03-03T05:58:49Z) - Representative & Fair Synthetic Data [68.8204255655161]
We present a framework to incorporate fairness constraints into the self-supervised learning process.
We generate a representative as well as fair version of the UCI Adult census data set.
We consider representative & fair synthetic data a promising future building block to teach algorithms not on historic worlds, but rather on the worlds that we strive to live in.
arXiv Detail & Related papers (2021-04-07T09:19:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.