Sharing is CAIRing: Characterizing Principles and Assessing Properties
of Universal Privacy Evaluation for Synthetic Tabular Data
- URL: http://arxiv.org/abs/2312.12216v1
- Date: Tue, 19 Dec 2023 15:05:52 GMT
- Title: Sharing is CAIRing: Characterizing Principles and Assessing Properties
of Universal Privacy Evaluation for Synthetic Tabular Data
- Authors: Tobias Hyrup, Anton Danholt Lautrup, Arthur Zimek, Peter
Schneider-Kamp
- Abstract summary: We identify four principles for the assessment of metrics: Comparability, Applicability, Interpretability, and Representativeness (CAIR)
We study the applicability and usefulness of the CAIR principles and rubric by assessing a selection of metrics popular in other studies.
- Score: 3.67056030380617
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data sharing is a necessity for innovative progress in many domains,
especially in healthcare. However, the ability to share data is hindered by
regulations protecting the privacy of natural persons. Synthetic tabular data
provide a promising solution to address data sharing difficulties but does not
inherently guarantee privacy. Still, there is a lack of agreement on
appropriate methods for assessing the privacy-preserving capabilities of
synthetic data, making it difficult to compare results across studies. To the
best of our knowledge, this is the first work to identify properties that
constitute good universal privacy evaluation metrics for synthetic tabular
data. The goal of such metrics is to enable comparability across studies and to
allow non-technical stakeholders to understand how privacy is protected. We
identify four principles for the assessment of metrics: Comparability,
Applicability, Interpretability, and Representativeness (CAIR). To quantify and
rank the degree to which evaluation metrics conform to the CAIR principles, we
design a rubric using a scale of 1-4. Each of the four properties is scored on
four parameters, yielding 16 total dimensions. We study the applicability and
usefulness of the CAIR principles and rubric by assessing a selection of
metrics popular in other studies. The results provide granular insights into
the strengths and weaknesses of existing metrics that not only rank the metrics
but highlight areas of potential improvements. We expect that the CAIR
principles will foster agreement among researchers and organizations on which
universal privacy evaluation metrics are appropriate for synthetic tabular
data.
Related papers
- Contrastive Learning-Based privacy metrics in Tabular Synthetic Datasets [40.67424997797513]
Synthetic data has garnered attention as a Privacy Enhancing Technology (PET) in sectors such as healthcare and finance.
Similarity-based methods aim at finding the level of similarity between training and synthetic data.
Attack-based methods conduce deliberate attacks on synthetic datasets.
arXiv Detail & Related papers (2025-02-19T15:52:23Z) - Synthetic Data Privacy Metrics [2.1213500139850017]
We review the pros and cons of popular metrics that include simulations of adversarial attacks.
We also review current best practices for amending generative models to enhance the privacy of the data they create.
arXiv Detail & Related papers (2025-01-07T17:02:33Z) - Defining 'Good': Evaluation Framework for Synthetic Smart Meter Data [14.779917834583577]
We show that standard privacy attack methods are inadequate for assessing privacy risks of smart meter datasets.
We propose an improved method by injecting training data with implausible outliers, then launching privacy attacks directly on these outliers.
arXiv Detail & Related papers (2024-07-16T14:41:27Z) - Scaling While Privacy Preserving: A Comprehensive Synthetic Tabular Data
Generation and Evaluation in Learning Analytics [0.412484724941528]
Privacy poses a significant obstacle to the progress of learning analytics (LA), presenting challenges like inadequate anonymization and data misuse.
Synthetic data emerges as a potential remedy, offering robust privacy protection.
Prior LA research on synthetic data lacks thorough evaluation, essential for assessing the delicate balance between privacy and data utility.
arXiv Detail & Related papers (2024-01-12T20:27:55Z) - OpinSummEval: Revisiting Automated Evaluation for Opinion Summarization [52.720711541731205]
We present OpinSummEval, a dataset comprising human judgments and outputs from 14 opinion summarization models.
Our findings indicate that metrics based on neural networks generally outperform non-neural ones.
arXiv Detail & Related papers (2023-10-27T13:09:54Z) - Theoretically Principled Federated Learning for Balancing Privacy and
Utility [61.03993520243198]
We propose a general learning framework for the protection mechanisms that protects privacy via distorting model parameters.
It can achieve personalized utility-privacy trade-off for each model parameter, on each client, at each communication round in federated learning.
arXiv Detail & Related papers (2023-05-24T13:44:02Z) - Auditing and Generating Synthetic Data with Controllable Trust Trade-offs [54.262044436203965]
We introduce a holistic auditing framework that comprehensively evaluates synthetic datasets and AI models.
It focuses on preventing bias and discrimination, ensures fidelity to the source data, assesses utility, robustness, and privacy preservation.
We demonstrate the framework's effectiveness by auditing various generative models across diverse use cases.
arXiv Detail & Related papers (2023-04-21T09:03:18Z) - How Do Input Attributes Impact the Privacy Loss in Differential Privacy? [55.492422758737575]
We study the connection between the per-subject norm in DP neural networks and individual privacy loss.
We introduce a novel metric termed the Privacy Loss-Input Susceptibility (PLIS) which allows one to apportion the subject's privacy loss to their input attributes.
arXiv Detail & Related papers (2022-11-18T11:39:03Z) - Investigating Crowdsourcing Protocols for Evaluating the Factual
Consistency of Summaries [59.27273928454995]
Current pre-trained models applied to summarization are prone to factual inconsistencies which misrepresent the source text or introduce extraneous information.
We create a crowdsourcing evaluation framework for factual consistency using the rating-based Likert scale and ranking-based Best-Worst Scaling protocols.
We find that ranking-based protocols offer a more reliable measure of summary quality across datasets, while the reliability of Likert ratings depends on the target dataset and the evaluation design.
arXiv Detail & Related papers (2021-09-19T19:05:00Z) - Estimation of Fair Ranking Metrics with Incomplete Judgments [70.37717864975387]
We propose a sampling strategy and estimation technique for four fair ranking metrics.
We formulate a robust and unbiased estimator which can operate even with very limited number of labeled items.
arXiv Detail & Related papers (2021-08-11T10:57:00Z) - Really Useful Synthetic Data -- A Framework to Evaluate the Quality of
Differentially Private Synthetic Data [2.538209532048867]
Recent advances in generating synthetic data that allow to add principled ways of protecting privacy are a crucial step in sharing statistical information in a privacy preserving way.
To further optimise the inherent trade-off between data privacy and data quality, it is necessary to think closely about the latter.
We develop a framework to evaluate the quality of differentially private synthetic data from an applied researcher's perspective.
arXiv Detail & Related papers (2020-04-16T16:24:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.