Causal Inference with Corrupted Data: Measurement Error, Missing Values,
Discretization, and Differential Privacy
- URL: http://arxiv.org/abs/2107.02780v6
- Date: Mon, 12 Feb 2024 16:33:09 GMT
- Title: Causal Inference with Corrupted Data: Measurement Error, Missing Values,
Discretization, and Differential Privacy
- Authors: Anish Agarwal and Rahul Singh
- Abstract summary: We formulate a semiparametric model of causal inference with high dimensional corrupted data.
We prove consistency and Gaussian approximation by finite sample arguments.
Our analysis provides nonasymptotic theoretical contributions to matrix completion, statistical learning, and semiparametric statistics.
- Score: 6.944765747195337
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The US Census Bureau will deliberately corrupt data sets derived from the
2020 US Census, enhancing the privacy of respondents while potentially reducing
the precision of economic analysis. To investigate whether this trade-off is
inevitable, we formulate a semiparametric model of causal inference with high
dimensional corrupted data. We propose a procedure for data cleaning,
estimation, and inference with data cleaning-adjusted confidence intervals. We
prove consistency and Gaussian approximation by finite sample arguments, with a
rate of $n^{ 1/2}$ for semiparametric estimands that degrades gracefully for
nonparametric estimands. Our key assumption is that the true covariates are
approximately low rank, which we interpret as approximate repeated measurements
and empirically validate. Our analysis provides nonasymptotic theoretical
contributions to matrix completion, statistical learning, and semiparametric
statistics. Calibrated simulations verify the coverage of our data cleaning
adjusted confidence intervals and demonstrate the relevance of our results for
Census-derived data.
Related papers
- Provably Reliable Conformal Prediction Sets in the Presence of Data Poisoning [53.42244686183879]
Conformal prediction provides model-agnostic and distribution-free uncertainty quantification.
Yet, conformal prediction is not reliable under poisoning attacks where adversaries manipulate both training and calibration data.
We propose reliable prediction sets (RPS): the first efficient method for constructing conformal prediction sets with provable reliability guarantees under poisoning.
arXiv Detail & Related papers (2024-10-13T15:37:11Z) - Risk and cross validation in ridge regression with correlated samples [72.59731158970894]
We provide training examples for the in- and out-of-sample risks of ridge regression when the data points have arbitrary correlations.
We further extend our analysis to the case where the test point has non-trivial correlations with the training set, setting often encountered in time series forecasting.
We validate our theory across a variety of high dimensional data.
arXiv Detail & Related papers (2024-08-08T17:27:29Z) - Split Conformal Prediction under Data Contamination [14.23965125128232]
We study the robustness of split conformal prediction in a data contamination setting.
We quantify the impact of corrupted data on the coverage and efficiency of the constructed sets.
We propose an adjustment in the classification setting which we call Contamination Robust Conformal Prediction.
arXiv Detail & Related papers (2024-07-10T14:33:28Z) - Geometry-Aware Instrumental Variable Regression [56.16884466478886]
We propose a transport-based IV estimator that takes into account the geometry of the data manifold through data-derivative information.
We provide a simple plug-and-play implementation of our method that performs on par with related estimators in standard settings.
arXiv Detail & Related papers (2024-05-19T17:49:33Z) - Selective Nonparametric Regression via Testing [54.20569354303575]
We develop an abstention procedure via testing the hypothesis on the value of the conditional variance at a given point.
Unlike existing methods, the proposed one allows to account not only for the value of the variance itself but also for the uncertainty of the corresponding variance predictor.
arXiv Detail & Related papers (2023-09-28T13:04:11Z) - Differentially Private Linear Regression with Linked Data [3.9325957466009203]
Differential privacy, a mathematical notion from computer science, is a rising tool offering robust privacy guarantees.
Recent work focuses on developing differentially private versions of individual statistical and machine learning tasks.
We present two differentially private algorithms for linear regression with linked data.
arXiv Detail & Related papers (2023-08-01T21:00:19Z) - Conformal Prediction with Missing Values [19.18178194789968]
We first show that the marginal coverage guarantee of conformal prediction holds on imputed data for any missingness distribution.
We then show that a universally consistent quantile regression algorithm trained on the imputed data is Bayes optimal for the pinball risk.
arXiv Detail & Related papers (2023-06-05T09:28:03Z) - Monotonicity and Double Descent in Uncertainty Estimation with Gaussian
Processes [52.92110730286403]
It is commonly believed that the marginal likelihood should be reminiscent of cross-validation metrics and that both should deteriorate with larger input dimensions.
We prove that by tuning hyper parameters, the performance, as measured by the marginal likelihood, improves monotonically with the input dimension.
We also prove that cross-validation metrics exhibit qualitatively different behavior that is characteristic of double descent.
arXiv Detail & Related papers (2022-10-14T08:09:33Z) - Conditional Feature Importance for Mixed Data [1.6114012813668934]
We develop a conditional predictive impact (CPI) framework with knockoff sampling.
We show that our proposed workflow controls type I error, achieves high power and is in line with results given by other conditional FI measures.
Our findings highlight the necessity of developing statistically adequate, specialized methods for mixed data.
arXiv Detail & Related papers (2022-10-06T16:52:38Z) - Predictive Data Calibration for Linear Correlation Significance Testing [0.0]
Pearson's correlation coefficient (PCC) is known to lack in both regards.
We propose a machine-learning-based predictive data calibration method.
arXiv Detail & Related papers (2022-08-15T09:19:06Z) - Evaluating representations by the complexity of learning low-loss
predictors [55.94170724668857]
We consider the problem of evaluating representations of data for use in solving a downstream task.
We propose to measure the quality of a representation by the complexity of learning a predictor on top of the representation that achieves low loss on a task of interest.
arXiv Detail & Related papers (2020-09-15T22:06:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.