PRIVET: Privacy Metric Based on Extreme Value Theory
- URL: http://arxiv.org/abs/2510.24233v1
- Date: Tue, 28 Oct 2025 09:42:03 GMT
- Title: PRIVET: Privacy Metric Based on Extreme Value Theory
- Authors: Antoine Szatkownik, Aurélien Decelle, Beatriz Seoane, Nicolas Bereux, Léo Planche, Guillaume Charpiat, Burak Yelmen, Flora Jay, Cyril Furtlehner,
- Abstract summary: Deep generative models are often trained on sensitive data, such as genetic sequences, health data, or more broadly, any copyrighted, licensed or protected content.<n>This raises critical concerns around privacy-preserving synthetic data, and more specifically around privacy leakage.<n>We propose PRIVET, a generic sample-based, modality-agnostic algorithm that assigns an individual privacy leak score to each synthetic sample.
- Score: 8.447463478355845
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep generative models are often trained on sensitive data, such as genetic sequences, health data, or more broadly, any copyrighted, licensed or protected content. This raises critical concerns around privacy-preserving synthetic data, and more specifically around privacy leakage, an issue closely tied to overfitting. Existing methods almost exclusively rely on global criteria to estimate the risk of privacy failure associated to a model, offering only quantitative non interpretable insights. The absence of rigorous evaluation methods for data privacy at the sample-level may hinder the practical deployment of synthetic data in real-world applications. Using extreme value statistics on nearest-neighbor distances, we propose PRIVET, a generic sample-based, modality-agnostic algorithm that assigns an individual privacy leak score to each synthetic sample. We empirically demonstrate that PRIVET reliably detects instances of memorization and privacy leakage across diverse data modalities, including settings with very high dimensionality, limited sample sizes such as genetic data and even under underfitting regimes. We compare our method to existing approaches under controlled settings and show its advantage in providing both dataset level and sample level assessments through qualitative and quantitative outputs. Additionally, our analysis reveals limitations in existing computer vision embeddings to yield perceptually meaningful distances when identifying near-duplicate samples.
Related papers
- Challenges in Enabling Private Data Valuation [17.450381366291754]
Data valuation methods quantify how individual training examples contribute to a model's behavior.<n> valuation scores can reveal whether a person's data was included in training, whether it was unusually influential, or what sensitive patterns exist in proprietary datasets.<n>Privacy is fundamentally in tension with valuation utility under differential privacy (DP)
arXiv Detail & Related papers (2026-02-27T22:21:14Z) - Empirical Evaluation of Structured Synthetic Data Privacy Metrics: Novel experimental framework [34.56525983543448]
Synthetic data generation is gaining traction as a privacy enhancing technology.<n>The concept of data privacy remains elusive, making it challenging for practitioners to evaluate and benchmark the degree of privacy protection offered by synthetic data.
arXiv Detail & Related papers (2025-12-18T08:09:28Z) - How to DP-fy Your Data: A Practical Guide to Generating Synthetic Data With Differential Privacy [52.00934156883483]
Differential Privacy (DP) is a framework for reasoning about and limiting information leakage.<n>Differentially Private Synthetic data refers to synthetic data that preserves the overall trends of source data.
arXiv Detail & Related papers (2025-12-02T21:14:39Z) - Synth-MIA: A Testbed for Auditing Privacy Leakage in Tabular Data Synthesis [8.4361320391543]
Tabular Generative Models are often argued to preserve privacy by creating synthetic datasets that resemble training data.<n>Membership Inference Attacks (MIAs) have recently emerged as a method for evaluating privacy leakage in synthetic data.<n>We propose a unified, model-agnostic threat framework that deploys a collection of attacks to estimate the maximum empirical privacy leakage in synthetic datasets.
arXiv Detail & Related papers (2025-09-22T16:53:38Z) - On the MIA Vulnerability Gap Between Private GANs and Diffusion Models [51.53790101362898]
Generative Adversarial Networks (GANs) and diffusion models have emerged as leading approaches for high-quality image synthesis.<n>We present the first unified theoretical and empirical analysis of the privacy risks faced by differentially private generative models.
arXiv Detail & Related papers (2025-09-03T14:18:22Z) - PASS: Private Attributes Protection with Stochastic Data Substitution [46.38957234350463]
Various studies have been proposed to protect private attributes by removing them from the data while maintaining the utilities of the data for downstream tasks.<n> PASS is designed to substitute the original sample with another one according to certain probabilities, which is trained with a novel loss function.<n>The comprehensive evaluation of PASS on various datasets of different modalities, including facial images, human activity sensory signals, and voice recording datasets, substantiates PASS's effectiveness and generalizability.
arXiv Detail & Related papers (2025-06-08T22:48:07Z) - A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage [77.83757117924995]
We propose a new framework that evaluates re-identification attacks to quantify individual privacy risks upon data release.<n>Our approach shows that seemingly innocuous auxiliary information can be used to infer sensitive attributes like age or substance use history from sanitized data.
arXiv Detail & Related papers (2025-04-28T01:16:27Z) - Private Estimation when Data and Privacy Demands are Correlated [5.755004576310333]
Differential Privacy is the current gold-standard for ensuring privacy for statistical queries.<n>We consider the problems of empirical mean estimation for univariate data and frequency estimation for categorical data.<n>We establish theoretical performance guarantees for our proposed algorithms, under both PAC error and mean-squared error.
arXiv Detail & Related papers (2024-07-15T22:46:02Z) - The Data Minimization Principle in Machine Learning [61.17813282782266]
Data minimization aims to reduce the amount of data collected, processed or retained.
It has been endorsed by various global data protection regulations.
However, its practical implementation remains a challenge due to the lack of a rigorous formulation.
arXiv Detail & Related papers (2024-05-29T19:40:27Z) - Simulation-based Bayesian Inference from Privacy Protected Data [0.0]
We propose simulation-based inference methods from privacy-protected datasets.<n>We illustrate our methods on discrete time-series data under an infectious disease model and with ordinary linear regression models.
arXiv Detail & Related papers (2023-10-19T14:34:17Z) - Membership Inference Attacks against Synthetic Data through Overfitting
Detection [84.02632160692995]
We argue for a realistic MIA setting that assumes the attacker has some knowledge of the underlying data distribution.
We propose DOMIAS, a density-based MIA model that aims to infer membership by targeting local overfitting of the generative model.
arXiv Detail & Related papers (2023-02-24T11:27:39Z) - Private Set Generation with Discriminative Information [63.851085173614]
Differentially private data generation is a promising solution to the data privacy challenge.
Existing private generative models are struggling with the utility of synthetic samples.
We introduce a simple yet effective method that greatly improves the sample utility of state-of-the-art approaches.
arXiv Detail & Related papers (2022-11-07T10:02:55Z) - On the Statistical Complexity of Estimation and Testing under Privacy Constraints [17.04261371990489]
We show how to characterize the power of a statistical test under differential privacy in a plug-and-play fashion.
We show that maintaining privacy results in a noticeable reduction in performance only when the level of privacy protection is very high.
Finally, we demonstrate that the DP-SGLD algorithm, a private convex solver, can be employed for maximum likelihood estimation with a high degree of confidence.
arXiv Detail & Related papers (2022-10-05T12:55:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.