Achieving Reliable Causal Inference with Data-Mined Variables: A Random
Forest Approach to the Measurement Error Problem
- URL: http://arxiv.org/abs/2012.10790v1
- Date: Sat, 19 Dec 2020 21:48:23 GMT
- Title: Achieving Reliable Causal Inference with Data-Mined Variables: A Random
Forest Approach to the Measurement Error Problem
- Authors: Mochen Yang, Edward McFowland III, Gordon Burtch and Gediminas
Adomavicius
- Abstract summary: A common empirical strategy involves the application of predictive modeling techniques to'mine' variables of interest from available data.
Recent work highlights that, because the predictions from machine learning models are inevitably imperfect, econometric analyses based on the predicted variables are likely to suffer from bias due to measurement error.
We propose a novel approach to mitigate these biases, leveraging the ensemble learning technique known as the random forest.
- Score: 1.5749416770494704
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Combining machine learning with econometric analysis is becoming increasingly
prevalent in both research and practice. A common empirical strategy involves
the application of predictive modeling techniques to 'mine' variables of
interest from available data, followed by the inclusion of those variables into
an econometric framework, with the objective of estimating causal effects.
Recent work highlights that, because the predictions from machine learning
models are inevitably imperfect, econometric analyses based on the predicted
variables are likely to suffer from bias due to measurement error. We propose a
novel approach to mitigate these biases, leveraging the ensemble learning
technique known as the random forest. We propose employing random forest not
just for prediction, but also for generating instrumental variables to address
the measurement error embedded in the prediction. The random forest algorithm
performs best when comprised of a set of trees that are individually accurate
in their predictions, yet which also make 'different' mistakes, i.e., have
weakly correlated prediction errors. A key observation is that these properties
are closely related to the relevance and exclusion requirements of valid
instrumental variables. We design a data-driven procedure to select tuples of
individual trees from a random forest, in which one tree serves as the
endogenous covariate and the other trees serve as its instruments. Simulation
experiments demonstrate the efficacy of the proposed approach in mitigating
estimation biases and its superior performance over three alternative methods
for bias correction.
Related papers
- Semiparametric conformal prediction [79.6147286161434]
Risk-sensitive applications require well-calibrated prediction sets over multiple, potentially correlated target variables.
We treat the scores as random vectors and aim to construct the prediction set accounting for their joint correlation structure.
We report desired coverage and competitive efficiency on a range of real-world regression problems.
arXiv Detail & Related papers (2024-11-04T14:29:02Z) - Building Trees for Probabilistic Prediction via Scoring Rules [0.0]
We study modifying a tree to produce nonparametric predictive distributions.
We find the standard method for building trees may not result in good predictive distributions.
We propose changing the splitting criteria for trees to one based on proper scoring rules.
arXiv Detail & Related papers (2024-02-16T20:04:13Z) - Inference with Mondrian Random Forests [6.97762648094816]
We give precise bias and variance characterizations, along with a Berry-Esseen-type central limit theorem, for the Mondrian random forest regression estimator.
We present valid statistical inference methods for the unknown regression function.
Efficient and implementable algorithms are devised for both batch and online learning settings.
arXiv Detail & Related papers (2023-10-15T01:41:42Z) - Structured Radial Basis Function Network: Modelling Diversity for
Multiple Hypotheses Prediction [51.82628081279621]
Multi-modal regression is important in forecasting nonstationary processes or with a complex mixture of distributions.
A Structured Radial Basis Function Network is presented as an ensemble of multiple hypotheses predictors for regression problems.
It is proved that this structured model can efficiently interpolate this tessellation and approximate the multiple hypotheses target distribution.
arXiv Detail & Related papers (2023-09-02T01:27:53Z) - Prediction-Powered Inference [68.97619568620709]
Prediction-powered inference is a framework for performing valid statistical inference when an experimental dataset is supplemented with predictions from a machine-learning system.
The framework yields simple algorithms for computing provably valid confidence intervals for quantities such as means, quantiles, and linear and logistic regression coefficients.
Prediction-powered inference could enable researchers to draw valid and more data-efficient conclusions using machine learning.
arXiv Detail & Related papers (2023-01-23T18:59:28Z) - Conformal prediction for the design problem [72.14982816083297]
In many real-world deployments of machine learning, we use a prediction algorithm to choose what data to test next.
In such settings, there is a distinct type of distribution shift between the training and test data.
We introduce a method to quantify predictive uncertainty in such settings.
arXiv Detail & Related papers (2022-02-08T02:59:12Z) - Variable selection with missing data in both covariates and outcomes:
Imputation and machine learning [1.0333430439241666]
The missing data issue is ubiquitous in health studies.
Machine learning methods weaken parametric assumptions.
XGBoost and BART have the overall best performance across various settings.
arXiv Detail & Related papers (2021-04-06T20:18:29Z) - Double Robust Representation Learning for Counterfactual Prediction [68.78210173955001]
We propose a novel scalable method to learn double-robust representations for counterfactual predictions.
We make robust and efficient counterfactual predictions for both individual and average treatment effects.
The algorithm shows competitive performance with the state-of-the-art on real world and synthetic data.
arXiv Detail & Related papers (2020-10-15T16:39:26Z) - Stable Prediction via Leveraging Seed Variable [73.9770220107874]
Previous machine learning methods might exploit subtly spurious correlations in training data induced by non-causal variables for prediction.
We propose a conditional independence test based algorithm to separate causal variables with a seed variable as priori, and adopt them for stable prediction.
Our algorithm outperforms state-of-the-art methods for stable prediction.
arXiv Detail & Related papers (2020-06-09T06:56:31Z) - A Numerical Transform of Random Forest Regressors corrects
Systematically-Biased Predictions [0.0]
We find a systematic bias in predictions from random forest models.
This bias is recapitulated in simple synthetic datasets.
We use the training data to define a numerical transformation that fully corrects it.
arXiv Detail & Related papers (2020-03-16T21:18:06Z) - Fr\'echet random forests for metric space valued regression with non
euclidean predictors [0.0]
We introduce Fr'echet trees and Fr'echet random forests, which allow to handle data for which input and output variables take values in general metric spaces.
A consistency theorem for Fr'echet regressogram predictor using data-driven partitions is given and applied to Fr'echet purely uniformly random trees.
arXiv Detail & Related papers (2019-06-04T22:07:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.