Epistemic Parity: Reproducibility as an Evaluation Metric for
Differential Privacy
- URL: http://arxiv.org/abs/2208.12700v3
- Date: Wed, 31 May 2023 23:42:13 GMT
- Title: Epistemic Parity: Reproducibility as an Evaluation Metric for
Differential Privacy
- Authors: Lucas Rosenblatt, Bernease Herman, Anastasia Holovenko, Wonkwon Lee,
Joshua Loftus, Elizabeth McKinnie, Taras Rumezhak, Andrii Stadnik, Bill Howe,
Julia Stoyanovich
- Abstract summary: We propose an evaluation methodology for synthetic data that avoids assumptions about the representativeness of proxy tasks.
We measure the likelihood that published conclusions would change had the authors used synthetic data.
We advocate for a new class of mechanisms that favor stronger utility guarantees and offer privacy protection.
- Score: 9.755020926517291
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Differential privacy (DP) data synthesizers support public release of
sensitive information, offering theoretical guarantees for privacy but limited
evidence of utility in practical settings. Utility is typically measured as the
error on representative proxy tasks, such as descriptive statistics, accuracy
of trained classifiers, or performance over a query workload. The ability for
these results to generalize to practitioners' experience has been questioned in
a number of settings, including the U.S. Census. In this paper, we propose an
evaluation methodology for synthetic data that avoids assumptions about the
representativeness of proxy tasks, instead measuring the likelihood that
published conclusions would change had the authors used synthetic data, a
condition we call epistemic parity. Our methodology consists of reproducing
empirical conclusions of peer-reviewed papers on real, publicly available data,
then re-running these experiments a second time on DP synthetic data, and
comparing the results.
We instantiate our methodology over a benchmark of recent peer-reviewed
papers that analyze public datasets in the ICPSR repository. We model
quantitative claims computationally to automate the experimental workflow, and
model qualitative claims by reproducing visualizations and comparing the
results manually. We then generate DP synthetic datasets using multiple
state-of-the-art mechanisms, and estimate the likelihood that these conclusions
will hold. We find that state-of-the-art DP synthesizers are able to achieve
high epistemic parity for several papers in our benchmark. However, some
papers, and particularly some specific findings, are difficult to reproduce for
any of the synthesizers. We advocate for a new class of mechanisms that favor
stronger utility guarantees and offer privacy protection with a focus on
application-specific threat models and risk-assessment.
Related papers
- A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models [3.672850225066168]
generative AI and large language models (LLMs) have opened up new avenues for producing synthetic data.
Despite the potential benefits, concerns regarding privacy leakage have surfaced.
We introduce SynEval, an open-source evaluation framework designed to assess the fidelity, utility, and privacy preservation of synthetically generated tabular data.
arXiv Detail & Related papers (2024-04-20T08:08:28Z) - Towards Biologically Plausible and Private Gene Expression Data
Generation [47.72947816788821]
Generative models trained with Differential Privacy (DP) are becoming increasingly prominent in the creation of synthetic data for downstream applications.
Existing literature, however, primarily focuses on basic benchmarking datasets and tends to report promising results only for elementary metrics and relatively simple data distributions.
We initiate a systematic analysis of how DP generative models perform in their natural application scenarios, specifically focusing on real-world gene expression data.
arXiv Detail & Related papers (2024-02-07T14:39:11Z) - Benchmarking Private Population Data Release Mechanisms: Synthetic Data vs. TopDown [50.40020716418472]
This study conducts a comparison between the TopDown algorithm and private synthetic data generation to determine how accuracy is affected by query complexity.
Our results show that for in-distribution queries, the TopDown algorithm achieves significantly better privacy-fidelity tradeoffs than any of the synthetic data methods we evaluated.
arXiv Detail & Related papers (2024-01-31T17:38:34Z) - Statistical properties and privacy guarantees of an original
distance-based fully synthetic data generation method [0.0]
This work shows the technical feasibility of generating publicly releasable synthetic data using a multi-step framework.
By successfully assessing the quality of data produced using a novel multi-step synthetic data generation framework, we showed the technical and conceptual soundness of the Open-CESP initiative.
arXiv Detail & Related papers (2023-10-10T12:29:57Z) - SoK: Privacy-Preserving Data Synthesis [72.92263073534899]
This paper focuses on privacy-preserving data synthesis (PPDS) by providing a comprehensive overview, analysis, and discussion of the field.
We put forth a master recipe that unifies two prominent strands of research in PPDS: statistical methods and deep learning (DL)-based methods.
arXiv Detail & Related papers (2023-07-05T08:29:31Z) - Perturbation-Assisted Sample Synthesis: A Novel Approach for Uncertainty
Quantification [3.175239447683357]
This paper introduces a novel Perturbation-Assisted Inference (PAI) framework utilizing synthetic data generated by the Perturbation-Assisted Sample Synthesis (PASS) method.
The framework focuses on uncertainty quantification in complex data scenarios, particularly involving unstructured data.
We demonstrate the effectiveness of PAI in advancing uncertainty quantification in complex, data-driven tasks by applying it to diverse areas such as image synthesis, sentiment word analysis, multimodal inference, and the construction of prediction intervals.
arXiv Detail & Related papers (2023-05-30T01:01:36Z) - Revisiting the Evaluation of Image Synthesis with GANs [55.72247435112475]
This study presents an empirical investigation into the evaluation of synthesis performance, with generative adversarial networks (GANs) as a representative of generative models.
In particular, we make in-depth analyses of various factors, including how to represent a data point in the representation space, how to calculate a fair distance using selected samples, and how many instances to use from each set.
arXiv Detail & Related papers (2023-04-04T17:54:32Z) - Investigating Bias with a Synthetic Data Generator: Empirical Evidence
and Philosophical Interpretation [66.64736150040093]
Machine learning applications are becoming increasingly pervasive in our society.
Risk is that they will systematically spread the bias embedded in data.
We propose to analyze biases by introducing a framework for generating synthetic data with specific types of bias and their combinations.
arXiv Detail & Related papers (2022-09-13T11:18:50Z) - CAFE: Learning to Condense Dataset by Aligning Features [72.99394941348757]
We propose a novel scheme to Condense dataset by Aligning FEatures (CAFE)
At the heart of our approach is an effective strategy to align features from the real and synthetic data across various scales.
We validate the proposed CAFE across various datasets, and demonstrate that it generally outperforms the state of the art.
arXiv Detail & Related papers (2022-03-03T05:58:49Z) - Bias Mitigated Learning from Differentially Private Synthetic Data: A
Cautionary Tale [13.881022208028751]
Bias can affect all analyses as the synthetic data distribution is an inconsistent estimate of the real-data distribution.
We propose several bias mitigation strategies using privatized likelihood ratios.
We show that bias mitigation provides simple and effective privacy-compliant augmentation for general applications of synthetic data.
arXiv Detail & Related papers (2021-08-24T19:56:44Z) - An Analysis of the Deployment of Models Trained on Private Tabular
Synthetic Data: Unexpected Surprises [4.129847064263057]
Diferentially private (DP) synthetic datasets are a powerful approach for training machine learning models.
We study the effects of differentially private synthetic data generation on classification.
arXiv Detail & Related papers (2021-06-15T21:00:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.