Related papers: Causal Inference with Corrupted Data: Measurement Error, Missing Values, Discretization, and Differential Privacy

Causal Inference with Corrupted Data: Measurement Error, Missing Values, Discretization, and Differential Privacy

URL: http://arxiv.org/abs/2107.02780v6
Date: Mon, 12 Feb 2024 16:33:09 GMT
Title: Causal Inference with Corrupted Data: Measurement Error, Missing Values, Discretization, and Differential Privacy
Authors: Anish Agarwal and Rahul Singh
Abstract summary: We formulate a semiparametric model of causal inference with high dimensional corrupted data. We prove consistency and Gaussian approximation by finite sample arguments. Our analysis provides nonasymptotic theoretical contributions to matrix completion, statistical learning, and semiparametric statistics.
Score: 6.944765747195337
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The US Census Bureau will deliberately corrupt data sets derived from the 2020 US Census, enhancing the privacy of respondents while potentially reducing the precision of economic analysis. To investigate whether this trade-off is inevitable, we formulate a semiparametric model of causal inference with high dimensional corrupted data. We propose a procedure for data cleaning, estimation, and inference with data cleaning-adjusted confidence intervals. We prove consistency and Gaussian approximation by finite sample arguments, with a rate of $n^{ 1/2}$ for semiparametric estimands that degrades gracefully for nonparametric estimands. Our key assumption is that the true covariates are approximately low rank, which we interpret as approximate repeated measurements and empirically validate. Our analysis provides nonasymptotic theoretical contributions to matrix completion, statistical learning, and semiparametric statistics. Calibrated simulations verify the coverage of our data cleaning adjusted confidence intervals and demonstrate the relevance of our results for Census-derived data.

Related papers

Optimal Debiased Inference on Privatized Data via Indirect Estimation and Parametric Bootstrap [12.65121513620053]
Existing usage of the parametric bootstrap on privatized data ignored or avoided handling the effect of clamping.<n>We propose using the indirect inference method to estimate the parameter values consistently.<n>Our framework produces confidence intervals with well-calibrated coverage and performs hypothesis testing with the correct type I error.
arXiv Detail & Related papers (2025-07-14T19:12:16Z)
DUPRE: Data Utility Prediction for Efficient Data Valuation [49.60564885180563]
Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility and retraining the ML model for multiple data subsets. Our framework, textttDUPRE, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining. Specifically, given the evaluated data utilities of some data subsets, textttDUPRE fits a emphGaussian process (GP) regression model to predict the utility of every other data subset.
arXiv Detail & Related papers (2025-02-22T08:53:39Z)
Provably Reliable Conformal Prediction Sets in the Presence of Data Poisoning [53.42244686183879]
Conformal prediction provides model-agnostic and distribution-free uncertainty quantification. Yet, conformal prediction is not reliable under poisoning attacks where adversaries manipulate both training and calibration data. We propose reliable prediction sets (RPS): the first efficient method for constructing conformal prediction sets with provable reliability guarantees under poisoning.
arXiv Detail & Related papers (2024-10-13T15:37:11Z)
Risk and cross validation in ridge regression with correlated samples [72.59731158970894]
We provide training examples for the in- and out-of-sample risks of ridge regression when the data points have arbitrary correlations. We further extend our analysis to the case where the test point has non-trivial correlations with the training set, setting often encountered in time series forecasting. We validate our theory across a variety of high dimensional data.
arXiv Detail & Related papers (2024-08-08T17:27:29Z)
Split Conformal Prediction under Data Contamination [14.23965125128232]
We study the robustness of split conformal prediction in a data contamination setting. We quantify the impact of corrupted data on the coverage and efficiency of the constructed sets. We propose an adjustment in the classification setting which we call Contamination Robust Conformal Prediction.
arXiv Detail & Related papers (2024-07-10T14:33:28Z)
Geometry-Aware Instrumental Variable Regression [56.16884466478886]
We propose a transport-based IV estimator that takes into account the geometry of the data manifold through data-derivative information. We provide a simple plug-and-play implementation of our method that performs on par with related estimators in standard settings.
arXiv Detail & Related papers (2024-05-19T17:49:33Z)
Selective Nonparametric Regression via Testing [54.20569354303575]
We develop an abstention procedure via testing the hypothesis on the value of the conditional variance at a given point. Unlike existing methods, the proposed one allows to account not only for the value of the variance itself but also for the uncertainty of the corresponding variance predictor.
arXiv Detail & Related papers (2023-09-28T13:04:11Z)
Differentially Private Linear Regression with Linked Data [3.9325957466009203]
Differential privacy, a mathematical notion from computer science, is a rising tool offering robust privacy guarantees. Recent work focuses on developing differentially private versions of individual statistical and machine learning tasks. We present two differentially private algorithms for linear regression with linked data.
arXiv Detail & Related papers (2023-08-01T21:00:19Z)
Conformal Prediction with Missing Values [19.18178194789968]
We first show that the marginal coverage guarantee of conformal prediction holds on imputed data for any missingness distribution. We then show that a universally consistent quantile regression algorithm trained on the imputed data is Bayes optimal for the pinball risk.
arXiv Detail & Related papers (2023-06-05T09:28:03Z)
Monotonicity and Double Descent in Uncertainty Estimation with Gaussian Processes [52.92110730286403]
It is commonly believed that the marginal likelihood should be reminiscent of cross-validation metrics and that both should deteriorate with larger input dimensions. We prove that by tuning hyper parameters, the performance, as measured by the marginal likelihood, improves monotonically with the input dimension. We also prove that cross-validation metrics exhibit qualitatively different behavior that is characteristic of double descent.
arXiv Detail & Related papers (2022-10-14T08:09:33Z)
Conditional Feature Importance for Mixed Data [1.6114012813668934]
We develop a conditional predictive impact (CPI) framework with knockoff sampling. We show that our proposed workflow controls type I error, achieves high power and is in line with results given by other conditional FI measures. Our findings highlight the necessity of developing statistically adequate, specialized methods for mixed data.
arXiv Detail & Related papers (2022-10-06T16:52:38Z)
Predictive Data Calibration for Linear Correlation Significance Testing [0.0]
Pearson's correlation coefficient (PCC) is known to lack in both regards. We propose a machine-learning-based predictive data calibration method.
arXiv Detail & Related papers (2022-08-15T09:19:06Z)
Evaluating representations by the complexity of learning low-loss predictors [55.94170724668857]
We consider the problem of evaluating representations of data for use in solving a downstream task. We propose to measure the quality of a representation by the complexity of learning a predictor on top of the representation that achieves low loss on a task of interest.
arXiv Detail & Related papers (2020-09-15T22:06:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.