Leveraging Public Data for Practical Private Query Release
- URL: http://arxiv.org/abs/2102.08598v1
- Date: Wed, 17 Feb 2021 06:19:34 GMT
- Title: Leveraging Public Data for Practical Private Query Release
- Authors: Terrance Liu, Giuseppe Vietri, Thomas Steinke, Jonathan Ullman, Zhiwei
Steven Wu
- Abstract summary: We present PMWPub, which -- unlike existing baselines -- leverages public data drawn from a related distribution as prior information.
We provide a theoretical analysis and an empirical evaluation on the American Community Survey (ACS) and ADULT datasets.
PMWPub scales well to high-dimensional data domains, where running many existing methods would be computationally infeasible.
- Score: 24.615338449313676
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In many statistical problems, incorporating priors can significantly improve
performance. However, the use of prior knowledge in differentially private
query release has remained underexplored, despite such priors commonly being
available in the form of public datasets, such as previous US Census releases.
With the goal of releasing statistics about a private dataset, we present
PMW^Pub, which -- unlike existing baselines -- leverages public data drawn from
a related distribution as prior information. We provide a theoretical analysis
and an empirical evaluation on the American Community Survey (ACS) and ADULT
datasets, which shows that our method outperforms state-of-the-art methods.
Furthermore, PMW^Pub scales well to high-dimensional data domains, where
running many existing methods would be computationally infeasible.
Related papers
- Do You Really Need Public Data? Surrogate Public Data for Differential Privacy on Tabular Data [10.1687640711587]
This work introduces the notion of "surrogate" public data, which consume no privacy loss budget and are constructed solely from publicly available schema or metadata.
We automate the process of generating surrogate public data with large language models (LLMs)
In particular, we propose two methods: direct record generation as CSV files, and automated structural causal model (SCM) construction for sampling records.
arXiv Detail & Related papers (2025-04-19T17:55:10Z) - Leveraging Vertical Public-Private Split for Improved Synthetic Data Generation [9.819636361032256]
Differentially Private Synthetic Data Generation is a key enabler of private and secure data sharing.
Recent literature has explored scenarios where a small amount of public data is used to help enhance the quality of synthetic data.
We propose a novel framework that adapts horizontal public-assisted methods into the vertical setting.
arXiv Detail & Related papers (2025-04-15T08:59:03Z) - On Privately Estimating a Single Parameter [47.499748486548484]
We investigate differentially private estimators for individual parameters within larger parametric models.
By leveraging these private certificates, we provide computationally and statistical efficient mechanisms that release private statistics that are, at least in the sample size, essentially unimprovable.
We investigate the practicality of the algorithms both in simulated data and in real-world data from the American Community Survey and US Census, highlighting scenarios in which the new procedures are successful and identifying areas for future work.
arXiv Detail & Related papers (2025-03-21T15:57:12Z) - Privacy for Free: Leveraging Local Differential Privacy Perturbed Data from Multiple Services [10.822843258077997]
Local Differential Privacy (LDP) has emerged as a widely adopted privacy-preserving technique in modern data analytics.
This paper proposes a framework for collecting and aggregating data based on perturbed information from multiple services.
arXiv Detail & Related papers (2025-03-11T11:10:03Z) - Self-Comparison for Dataset-Level Membership Inference in Large (Vision-)Language Models [73.94175015918059]
We propose a dataset-level membership inference method based on Self-Comparison.
Our method does not require access to ground-truth member data or non-member data in identical distribution.
arXiv Detail & Related papers (2024-10-16T23:05:59Z) - Federated Prediction-Powered Inference from Decentralized Data [40.84399531998246]
Prediction-Powered Inference (PPI) has been proposed to ensure statistical validity despite the unreliability.
The Fed-PPI framework involves training local models on private data, aggregating them through Federated Learning (FL), and deriving confidence intervals using PPI.
arXiv Detail & Related papers (2024-09-03T09:14:18Z) - Source-Free Domain-Invariant Performance Prediction [68.39031800809553]
We propose a source-free approach centred on uncertainty-based estimation, using a generative model for calibration in the absence of source data.
Our experiments on benchmark object recognition datasets reveal that existing source-based methods fall short with limited source sample availability.
Our approach significantly outperforms the current state-of-the-art source-free and source-based methods, affirming its effectiveness in domain-invariant performance estimation.
arXiv Detail & Related papers (2024-08-05T03:18:58Z) - Uncertainty Quantification of Data Shapley via Statistical Inference [20.35973700939768]
The emergence of data markets underscores the growing importance of data valuation.
Within the machine learning landscape, Data Shapley stands out as a widely embraced method for data valuation.
This paper establishes the relationship between Data Shapley and infinite-order U-statistics.
arXiv Detail & Related papers (2024-07-28T02:54:27Z) - Synthetic Census Data Generation via Multidimensional Multiset Sum [7.900694093691988]
We provide tools to generate synthetic microdata solely from published Census statistics.
We show that our methods work well in practice, and we offer theoretical arguments to explain our performance.
arXiv Detail & Related papers (2024-04-15T19:06:37Z) - Query of CC: Unearthing Large Scale Domain-Specific Knowledge from
Public Corpora [104.16648246740543]
We propose an efficient data collection method based on large language models.
The method bootstraps seed information through a large language model and retrieves related data from public corpora.
It not only collects knowledge-related data for specific domains but unearths the data with potential reasoning procedures.
arXiv Detail & Related papers (2024-01-26T03:38:23Z) - The Impact of Differential Feature Under-reporting on Algorithmic Fairness [86.275300739926]
We present an analytically tractable model of differential feature under-reporting.
We then use to characterize the impact of this kind of data bias on algorithmic fairness.
Our results show that, in real world data settings, under-reporting typically leads to increasing disparities.
arXiv Detail & Related papers (2024-01-16T19:16:22Z) - Optimal Locally Private Nonparametric Classification with Public Data [2.631955426232593]
We investigate the problem of public data assisted non-interactive Local Differentially Private (LDP) learning with a focus on non-parametric classification.
Under the posterior drift assumption, we derive the mini-max optimal convergence rate with LDP constraint.
We present a novel approach, the locally differentially private classification tree, which attains the mini-max optimal convergence rate.
arXiv Detail & Related papers (2023-11-19T16:35:01Z) - A Unified View of Differentially Private Deep Generative Modeling [60.72161965018005]
Data with privacy concerns comes with stringent regulations that frequently prohibited data access and data sharing.
Overcoming these obstacles is key for technological progress in many real-world application scenarios that involve privacy sensitive data.
Differentially private (DP) data publishing provides a compelling solution, where only a sanitized form of the data is publicly released.
arXiv Detail & Related papers (2023-09-27T14:38:16Z) - On PAC Learning Halfspaces in Non-interactive Local Privacy Model with
Public Unlabeled Data [18.820311737806456]
We study the problem of PAC learning halfspaces in the non-interactive local differential model (NLDP)
We show that it is possible to achieve sample complexities that are only linear in the dimension and in other terms for both private and public data.
arXiv Detail & Related papers (2022-09-17T12:19:20Z) - Post-processing of Differentially Private Data: A Fairness Perspective [53.29035917495491]
This paper shows that post-processing causes disparate impacts on individuals or groups.
It analyzes two critical settings: the release of differentially private datasets and the use of such private datasets for downstream decisions.
It proposes a novel post-processing mechanism that is (approximately) optimal under different fairness metrics.
arXiv Detail & Related papers (2022-01-24T02:45:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.