Evaluating tree-based imputation methods as an alternative to MICE PMM
for drawing inference in empirical studies
- URL: http://arxiv.org/abs/2401.09602v1
- Date: Wed, 17 Jan 2024 21:28:00 GMT
- Title: Evaluating tree-based imputation methods as an alternative to MICE PMM
for drawing inference in empirical studies
- Authors: Jakob Schwerter, Ketevan Gurtskaia, Andr\'es Romero, Birgit
Zeyer-Gliozzo, Markus Pauly
- Abstract summary: Dealing with missing data is an important problem in statistical analysis that is often addressed with imputation procedures.
The prevailing method of Multiple Imputation by Chained Equations with Predictive Mean Matching (PMM) is considered standard in the social science literature.
In particular, tree-based imputation methods have emerged as very competitive approaches.
- Score: 0.5892638927736115
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dealing with missing data is an important problem in statistical analysis
that is often addressed with imputation procedures. The performance and
validity of such methods are of great importance for their application in
empirical studies. While the prevailing method of Multiple Imputation by
Chained Equations (MICE) with Predictive Mean Matching (PMM) is considered
standard in the social science literature, the increase in complex datasets may
require more advanced approaches based on machine learning. In particular,
tree-based imputation methods have emerged as very competitive approaches.
However, the performance and validity are not completely understood,
particularly compared to the standard MICE PMM. This is especially true for
inference in linear models. In this study, we investigate the impact of various
imputation methods on coefficient estimation, Type I error, and power, to gain
insights that can help empirical researchers deal with missingness more
effectively. We explore MICE PMM alongside different tree-based methods, such
as MICE with Random Forest (RF), Chained Random Forests with and without PMM
(missRanger), and Extreme Gradient Boosting (MIXGBoost), conducting a realistic
simulation study using the German National Educational Panel Study (NEPS) as
the original data source. Our results reveal that Random Forest-based
imputations, especially MICE RF and missRanger with PMM, consistently perform
better in most scenarios. Standard MICE PMM shows partially increased bias and
overly conservative test decisions, particularly with non-true zero
coefficients. Our results thus underscore the potential advantages of
tree-based imputation methods, albeit with a caveat that all methods perform
worse with an increased missingness, particularly missRanger.
Related papers
- A Gradient Analysis Framework for Rewarding Good and Penalizing Bad Examples in Language Models [63.949883238901414]
We present a unique angle of gradient analysis of loss functions that simultaneously reward good examples and penalize bad ones in LMs.
We find that ExMATE serves as a superior surrogate for MLE, and that combining DPO with ExMATE instead of MLE further enhances both the statistical (5-7%) and generative (+18% win rate) performance.
arXiv Detail & Related papers (2024-08-29T17:46:18Z) - Improving Bias Correction Standards by Quantifying its Effects on Treatment Outcomes [54.18828236350544]
Propensity score matching (PSM) addresses selection biases by selecting comparable populations for analysis.
Different matching methods can produce significantly different Average Treatment Effects (ATE) for the same task, even when meeting all validation criteria.
To address this issue, we introduce a novel metric, A2A, to reduce the number of valid matches.
arXiv Detail & Related papers (2024-07-20T12:42:24Z) - Querying Easily Flip-flopped Samples for Deep Active Learning [63.62397322172216]
Active learning is a machine learning paradigm that aims to improve the performance of a model by strategically selecting and querying unlabeled data.
One effective selection strategy is to base it on the model's predictive uncertainty, which can be interpreted as a measure of how informative a sample is.
This paper proposes the it least disagree metric (LDM) as the smallest probability of disagreement of the predicted label.
arXiv Detail & Related papers (2024-01-18T08:12:23Z) - Sparse high-dimensional linear mixed modeling with a partitioned empirical Bayes ECM algorithm [41.25603565852633]
This work presents an efficient and accurate Bayesian framework for high-dimensional LMMs.
The novelty of the approach lies in its partitioning and parameter expansion as well as its fast and scalable computation.
A real-world example is provided using data from a study of lupus in children, where we identify genes and clinical factors associated with a new lupus biomarker and predict the biomarker over time.
arXiv Detail & Related papers (2023-10-18T19:34:56Z) - B-Learner: Quasi-Oracle Bounds on Heterogeneous Causal Effects Under
Hidden Confounding [51.74479522965712]
We propose a meta-learner called the B-Learner, which can efficiently learn sharp bounds on the CATE function under limits on hidden confounding.
We prove its estimates are valid, sharp, efficient, and have a quasi-oracle property with respect to the constituent estimators under more general conditions than existing methods.
arXiv Detail & Related papers (2023-04-20T18:07:19Z) - In Search of Insights, Not Magic Bullets: Towards Demystification of the
Model Selection Dilemma in Heterogeneous Treatment Effect Estimation [92.51773744318119]
This paper empirically investigates the strengths and weaknesses of different model selection criteria.
We highlight that there is a complex interplay between selection strategies, candidate estimators and the data used for comparing them.
arXiv Detail & Related papers (2023-02-06T16:55:37Z) - Tradeoffs of Linear Mixed Models in Genome-wide Association Studies [18.560273425572582]
We study the statistical properties of linear mixed models (LMMs) applied to genome-wide association studies (GWAS)
First, we study the sensitivity of LMMs to the inclusion of a candidate SNP in the kinship matrix, which is often done in practice to speed up computations.
Second, we investigate how mixed models can correct confounders in GWAS, which is widely accepted as an advantage of LMMs over traditional methods.
arXiv Detail & Related papers (2021-11-05T22:05:59Z) - Counterfactual Maximum Likelihood Estimation for Training Deep Networks [83.44219640437657]
Deep learning models are prone to learning spurious correlations that should not be learned as predictive clues.
We propose a causality-based training framework to reduce the spurious correlations caused by observable confounders.
We conduct experiments on two real-world tasks: Natural Language Inference (NLI) and Image Captioning.
arXiv Detail & Related papers (2021-06-07T17:47:16Z) - Variable selection with missing data in both covariates and outcomes:
Imputation and machine learning [1.0333430439241666]
The missing data issue is ubiquitous in health studies.
Machine learning methods weaken parametric assumptions.
XGBoost and BART have the overall best performance across various settings.
arXiv Detail & Related papers (2021-04-06T20:18:29Z) - Are deep learning models superior for missing data imputation in large
surveys? Evidence from an empirical comparison [5.994312110645453]
Multiple imputation (MI) is the state-of-the-art approach for dealing with missing data arising from non-response in sample surveys.
Recent MI methods based on deep learning models have been developed with encouraging results in small studies.
This paper provides a framework for using simulations based on real survey data and several performance metrics to compare MI methods.
arXiv Detail & Related papers (2021-03-14T16:24:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.