Confidence-Ranked Reconstruction of Census Microdata from Published
Statistics
- URL: http://arxiv.org/abs/2211.03128v2
- Date: Mon, 6 Feb 2023 17:32:02 GMT
- Title: Confidence-Ranked Reconstruction of Census Microdata from Published
Statistics
- Authors: Travis Dick, Cynthia Dwork, Michael Kearns, Terrance Liu, Aaron Roth,
Giuseppe Vietri, Zhiwei Steven Wu
- Abstract summary: A reconstruction attack on a private dataset takes as input some publicly accessible information about the dataset.
We show that our attacks can not only reconstruct full rows from the aggregate query statistics $Q(D)Rmm$, but can do so in a way that reliably ranks reconstructed rows by their odds.
Our attacks significantly outperform those that are based only on access to a public distribution or population from which the private dataset $D$ was sampled.
- Score: 45.39928315344449
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A reconstruction attack on a private dataset $D$ takes as input some publicly
accessible information about the dataset and produces a list of candidate
elements of $D$. We introduce a new class of data reconstruction attacks based
on randomized methods for non-convex optimization. We empirically demonstrate
that our attacks can not only reconstruct full rows of $D$ from aggregate query
statistics $Q(D)\in \mathbb{R}^m$, but can do so in a way that reliably ranks
reconstructed rows by their odds of appearing in the private data, providing a
signature that could be used for prioritizing reconstructed rows for further
actions such as identify theft or hate crime. We also design a sequence of
baselines for evaluating reconstruction attacks. Our attacks significantly
outperform those that are based only on access to a public distribution or
population from which the private dataset $D$ was sampled, demonstrating that
they are exploiting information in the aggregate statistics $Q(D)$, and not
simply the overall structure of the distribution. In other words, the queries
$Q(D)$ are permitting reconstruction of elements of this dataset, not the
distribution from which $D$ was drawn. These findings are established both on
2010 U.S. decennial Census data and queries and Census-derived American
Community Survey datasets. Taken together, our methods and experiments
illustrate the risks in releasing numerically precise aggregate statistics of a
large dataset, and provide further motivation for the careful application of
provably private techniques such as differential privacy.
Related papers
- On Differentially Private U Statistics [25.683071759227293]
We propose a new thresholding-based approach using emphlocal H'ajek projections to reweight different subsets of the data.
This leads to nearly optimal private error for non-degenerate U-statistics and a strong indication of near-optimality for degenerate U-statistics.
arXiv Detail & Related papers (2024-07-06T03:27:14Z) - Geometry-Aware Instrumental Variable Regression [56.16884466478886]
We propose a transport-based IV estimator that takes into account the geometry of the data manifold through data-derivative information.
We provide a simple plug-and-play implementation of our method that performs on par with related estimators in standard settings.
arXiv Detail & Related papers (2024-05-19T17:49:33Z) - Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic [99.3682210827572]
Vision-language models (VLMs) are trained for thousands of GPU hours on carefully curated web datasets.
Data curation strategies are typically developed agnostic of the available compute for training.
We introduce neural scaling laws that account for the non-homogeneous nature of web data.
arXiv Detail & Related papers (2024-04-10T17:27:54Z) - Membership Inference Attacks against Synthetic Data through Overfitting
Detection [84.02632160692995]
We argue for a realistic MIA setting that assumes the attacker has some knowledge of the underlying data distribution.
We propose DOMIAS, a density-based MIA model that aims to infer membership by targeting local overfitting of the generative model.
arXiv Detail & Related papers (2023-02-24T11:27:39Z) - Generating Data to Mitigate Spurious Correlations in Natural Language
Inference Datasets [27.562256973255728]
Natural language processing models often exploit spurious correlations between task-independent features and labels in datasets to perform well only within the distributions they are trained on.
We propose to tackle this problem by generating a debiased version of a dataset, which can then be used to train a debiased, off-the-shelf model.
Our approach consists of 1) a method for training data generators to generate high-quality, label-consistent data samples; and 2) a filtering mechanism for removing data points that contribute to spurious correlations.
arXiv Detail & Related papers (2022-03-24T09:08:05Z) - A Statistical Learning View of Simple Kriging [0.0]
We analyze the simple Kriging task from a statistical learning perspective.
The goal is to predict the unknown values it takes at any other location with minimum quadratic risk.
We prove non-asymptotic bounds of order $O_mathbbP (1/sqrtn)$ for the excess risk of a plug-in predictive rule mimicking the true minimizer.
arXiv Detail & Related papers (2022-02-15T12:46:43Z) - Datamodels: Predicting Predictions from Training Data [86.66720175866415]
We present a conceptual framework, datamodeling, for analyzing the behavior of a model class in terms of the training data.
We show that even simple linear datamodels can successfully predict model outputs.
arXiv Detail & Related papers (2022-02-01T18:15:24Z) - Public Data-Assisted Mirror Descent for Private Model Training [23.717811604829148]
We revisit the problem of using public data to improve the privacy/utility tradeoffs for differentially private (DP) model training.
We show that our algorithm not only significantly improves traditional DP-SGD and DP-FedAvg, but also improves over DP-SGD and DP-FedAvg on models that have been pre-trained with the public data.
arXiv Detail & Related papers (2021-12-01T00:21:40Z) - Strongly universally consistent nonparametric regression and
classification with privatised data [2.879036956042183]
We revisit the classical problem of nonparametric regression, but impose local differential privacy constraints.
We design a novel estimator of the regression function, which can be viewed as a privatised version of the well-studied partitioning regression estimator.
arXiv Detail & Related papers (2020-10-31T09:00:43Z) - Identifying Statistical Bias in Dataset Replication [102.92137353938388]
We study a replication of the ImageNet dataset on which models exhibit a significant (11-14%) drop in accuracy.
After correcting for the identified statistical bias, only an estimated $3.6% pm 1.5%$ of the original $11.7% pm 1.0%$ accuracy drop remains unaccounted for.
arXiv Detail & Related papers (2020-05-19T17:48:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.