Adapting tree-based multiple imputation methods for multi-level data? A
simulation study
- URL: http://arxiv.org/abs/2401.14161v1
- Date: Thu, 25 Jan 2024 13:12:50 GMT
- Title: Adapting tree-based multiple imputation methods for multi-level data? A
simulation study
- Authors: Ketevan Gurtskaia, Jakob Schwerter and Philipp Doebler
- Abstract summary: This simulation study evaluates the effectiveness of multiple imputation techniques for multilevel data.
It compares the performance of traditional Multiple Imputation by Chained Equations (MICE) with tree-based methods.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This simulation study evaluates the effectiveness of multiple imputation (MI)
techniques for multilevel data. It compares the performance of traditional
Multiple Imputation by Chained Equations (MICE) with tree-based methods such as
Chained Random Forests with Predictive Mean Matching and Extreme Gradient
Boosting. Adapted versions that include dummy variables for cluster membership
are also included for the tree-based methods. Methods are evaluated for
coefficient estimation bias, statistical power, and type I error rates on
simulated hierarchical data with different cluster sizes (25 and 50) and levels
of missingness (10\% and 50\%). Coefficients are estimated using random
intercept and random slope models. The results show that while MICE is
preferred for accurate rejection rates, Extreme Gradient Boosting is
advantageous for reducing bias. Furthermore, the study finds that bias levels
are similar across different cluster sizes, but rejection rates tend to be less
favorable with fewer clusters (lower power, higher type I error). In addition,
the inclusion of cluster dummies in tree-based methods improves estimation for
Level 1 variables, but is less effective for Level 2 variables. When data
become too complex and MICE is too slow, extreme gradient boosting is a good
alternative for hierarchical data.
Keywords: Multiple imputation; multi-level data; MICE; missRanger; mixgb
Related papers
- Evaluating tree-based imputation methods as an alternative to MICE PMM
for drawing inference in empirical studies [0.5892638927736115]
Dealing with missing data is an important problem in statistical analysis that is often addressed with imputation procedures.
The prevailing method of Multiple Imputation by Chained Equations with Predictive Mean Matching (PMM) is considered standard in the social science literature.
In particular, tree-based imputation methods have emerged as very competitive approaches.
arXiv Detail & Related papers (2024-01-17T21:28:00Z) - Minimally Supervised Learning using Topological Projections in
Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs)
Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU)
Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z) - Deep Ensembles Meets Quantile Regression: Uncertainty-aware Imputation
for Time Series [49.992908221544624]
Time series data often exhibit numerous missing values, which is the time series imputation task.
Previous deep learning methods have been shown to be effective for time series imputation.
We propose a non-generative time series imputation method that produces accurate imputations with inherent uncertainty.
arXiv Detail & Related papers (2023-12-03T05:52:30Z) - Compound Batch Normalization for Long-tailed Image Classification [77.42829178064807]
We propose a compound batch normalization method based on a Gaussian mixture.
It can model the feature space more comprehensively and reduce the dominance of head classes.
The proposed method outperforms existing methods on long-tailed image classification.
arXiv Detail & Related papers (2022-12-02T07:31:39Z) - Condensed Gradient Boosting [0.0]
We propose the use of multi-output regressors as base models to handle the multi-class problem as a single task.
An extensive comparison with other multi-ouptut based gradient boosting methods is carried out in terms of generalization and computational efficiency.
arXiv Detail & Related papers (2022-11-26T15:53:19Z) - Distributional Adaptive Soft Regression Trees [0.0]
This article proposes a new type of a distributional regression tree using a multivariate soft split rule.
One great advantage of the soft split is that smooth high-dimensional functions can be estimated with only one tree.
We show by means of extensive simulation studies that the algorithm has excellent properties and outperforms various benchmark methods.
arXiv Detail & Related papers (2022-10-19T08:59:02Z) - On multivariate randomized classification trees: $l_0$-based sparsity,
VC~dimension and decomposition methods [0.9346127431927981]
We investigate the nonlinear continuous optimization formulation proposed in Blanquero et al.
We first consider alternative methods to sparsify such trees based on concave approximations of the $l_0$ norm"
We propose a general decomposition scheme and an efficient version of it. Experiments on larger datasets show that the proposed decomposition method is able to significantly reduce the training times without compromising the accuracy.
arXiv Detail & Related papers (2021-12-09T22:49:08Z) - Attentional-Biased Stochastic Gradient Descent [74.49926199036481]
We present a provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning.
Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch.
ABSGD is flexible enough to combine with other robust losses without any additional cost.
arXiv Detail & Related papers (2020-12-13T03:41:52Z) - Handling missing data in model-based clustering [0.0]
We propose two methods to fit Gaussian mixtures in the presence of missing data.
Both methods use a variant of the Monte Carlo Expectation-Maximisation algorithm for data augmentation.
We show that the proposed methods outperform the multiple imputation approach, both in terms of clusters identification and density estimation.
arXiv Detail & Related papers (2020-06-04T15:36:31Z) - Carath\'eodory Sampling for Stochastic Gradient Descent [79.55586575988292]
We present an approach that is inspired by classical results of Tchakaloff and Carath'eodory about measure reduction.
We adaptively select the descent steps where the measure reduction is carried out.
We combine this with Block Coordinate Descent so that measure reduction can be done very cheaply.
arXiv Detail & Related papers (2020-06-02T17:52:59Z) - Adaptive Correlated Monte Carlo for Contextual Categorical Sequence
Generation [77.7420231319632]
We adapt contextual generation of categorical sequences to a policy gradient estimator, which evaluates a set of correlated Monte Carlo (MC) rollouts for variance control.
We also demonstrate the use of correlated MC rollouts for binary-tree softmax models, which reduce the high generation cost in large vocabulary scenarios.
arXiv Detail & Related papers (2019-12-31T03:01:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.