Statistical properties and privacy guarantees of an original
distance-based fully synthetic data generation method
- URL: http://arxiv.org/abs/2310.06571v1
- Date: Tue, 10 Oct 2023 12:29:57 GMT
- Title: Statistical properties and privacy guarantees of an original
distance-based fully synthetic data generation method
- Authors: R\'emy Chapelle (CESP, EVDG), Bruno Falissard (CESP)
- Abstract summary: This work shows the technical feasibility of generating publicly releasable synthetic data using a multi-step framework.
By successfully assessing the quality of data produced using a novel multi-step synthetic data generation framework, we showed the technical and conceptual soundness of the Open-CESP initiative.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Introduction: The amount of data generated by original research is growing
exponentially. Publicly releasing them is recommended to comply with the Open
Science principles. However, data collected from human participants cannot be
released as-is without raising privacy concerns. Fully synthetic data represent
a promising answer to this challenge. This approach is explored by the French
Centre de Recherche en {\'E}pid{\'e}miologie et Sant{\'e} des Populations in
the form of a synthetic data generation framework based on Classification and
Regression Trees and an original distance-based filtering. The goal of this
work was to develop a refined version of this framework and to assess its
risk-utility profile with empirical and formal tools, including novel ones
developed for the purpose of this evaluation.Materials and Methods: Our
synthesis framework consists of four successive steps, each of which is
designed to prevent specific risks of disclosure. We assessed its performance
by applying two or more of these steps to a rich epidemiological dataset.
Privacy and utility metrics were computed for each of the resulting synthetic
datasets, which were further assessed using machine learning
approaches.Results: Computed metrics showed a satisfactory level of protection
against attribute disclosure attacks for each synthetic dataset, especially
when the full framework was used. Membership disclosure attacks were formally
prevented without significantly altering the data. Machine learning approaches
showed a low risk of success for simulated singling out and linkability
attacks. Distributional and inferential similarity with the original data were
high with all datasets.Discussion: This work showed the technical feasibility
of generating publicly releasable synthetic data using a multi-step framework.
Formal and empirical tools specifically developed for this demonstration are a
valuable contribution to this field. Further research should focus on the
extension and validation of these tools, in an effort to specify the intrinsic
qualities of alternative data synthesis methods.Conclusion: By successfully
assessing the quality of data produced using a novel multi-step synthetic data
generation framework, we showed the technical and conceptual soundness of the
Open-CESP initiative, which seems ripe for full-scale implementation.
Related papers
- A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models [3.672850225066168]
generative AI and large language models (LLMs) have opened up new avenues for producing synthetic data.
Despite the potential benefits, concerns regarding privacy leakage have surfaced.
We introduce SynEval, an open-source evaluation framework designed to assess the fidelity, utility, and privacy preservation of synthetically generated tabular data.
arXiv Detail & Related papers (2024-04-20T08:08:28Z) - Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A
Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models.
ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task.
This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z) - Synthetic data generation for a longitudinal cohort study -- Evaluation,
method extension and reproduction of published data analysis results [0.32593385688760446]
In the health sector, access to individual-level data is often challenging due to privacy concerns.
A promising alternative is the generation of fully synthetic data.
In this study, we use a state-of-the-art synthetic data generation method.
arXiv Detail & Related papers (2023-05-12T13:13:55Z) - Delving into High-Quality Synthetic Face Occlusion Segmentation Datasets [83.749895930242]
We propose two techniques for producing high-quality naturalistic synthetic occluded faces.
We empirically show the effectiveness and robustness of both methods, even for unseen occlusions.
We present two high-resolution real-world occluded face datasets with fine-grained annotations, RealOcc and RealOcc-Wild.
arXiv Detail & Related papers (2022-05-12T17:03:57Z) - Natural Language Inference with Self-Attention for Veracity Assessment
of Pandemic Claims [54.93898455714295]
We first describe the construction of the novel PANACEA dataset consisting of heterogeneous claims on COVID-19.
We then propose novel techniques for automated veracity assessment based on Natural Language Inference.
arXiv Detail & Related papers (2022-05-05T12:11:31Z) - CAFE: Learning to Condense Dataset by Aligning Features [72.99394941348757]
We propose a novel scheme to Condense dataset by Aligning FEatures (CAFE)
At the heart of our approach is an effective strategy to align features from the real and synthetic data across various scales.
We validate the proposed CAFE across various datasets, and demonstrate that it generally outperforms the state of the art.
arXiv Detail & Related papers (2022-03-03T05:58:49Z) - Representative & Fair Synthetic Data [68.8204255655161]
We present a framework to incorporate fairness constraints into the self-supervised learning process.
We generate a representative as well as fair version of the UCI Adult census data set.
We consider representative & fair synthetic data a promising future building block to teach algorithms not on historic worlds, but rather on the worlds that we strive to live in.
arXiv Detail & Related papers (2021-04-07T09:19:46Z) - Holdout-Based Fidelity and Privacy Assessment of Mixed-Type Synthetic
Data [0.0]
AI-based data synthesis has seen rapid progress over the last several years, and is increasingly recognized for its promise to enable privacy-respecting data sharing.
We introduce and demonstrate a holdout-based empirical assessment framework for quantifying the fidelity as well as the privacy risk of synthetic data solutions.
arXiv Detail & Related papers (2021-04-01T17:30:23Z) - Measuring Utility and Privacy of Synthetic Genomic Data [3.635321290763711]
We provide the first evaluation of the utility and the privacy protection of five state-of-the-art models for generating synthetic genomic data.
Overall, there is no single approach for generating synthetic genomic data that performs well across the board.
arXiv Detail & Related papers (2021-02-05T17:41:01Z) - Partially Conditioned Generative Adversarial Networks [75.08725392017698]
Generative Adversarial Networks (GANs) let one synthesise artificial datasets by implicitly modelling the underlying probability distribution of a real-world training dataset.
With the introduction of Conditional GANs and their variants, these methods were extended to generating samples conditioned on ancillary information available for each sample within the dataset.
In this work, we argue that standard Conditional GANs are not suitable for such a task and propose a new Adversarial Network architecture and training strategy.
arXiv Detail & Related papers (2020-07-06T15:59:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.