When to Impute? Imputation before and during cross-validation
- URL: http://arxiv.org/abs/2010.00718v1
- Date: Thu, 1 Oct 2020 23:04:16 GMT
- Title: When to Impute? Imputation before and during cross-validation
- Authors: Byron C. Jaeger, Nicholas J. Tierney, Noah R. Simon
- Abstract summary: Cross-validation (CV) is a technique used to estimate generalization error for prediction models.
It has been recommended the entire sequence of steps be carried out during each replicate of CV to mimic the application of the entire pipeline to an external testing set.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-validation (CV) is a technique used to estimate generalization error
for prediction models. For pipeline modeling algorithms (i.e. modeling
procedures with multiple steps), it has been recommended the entire sequence of
steps be carried out during each replicate of CV to mimic the application of
the entire pipeline to an external testing set. While theoretically sound,
following this recommendation can lead to high computational costs when a
pipeline modeling algorithm includes computationally expensive operations, e.g.
imputation of missing values. There is a general belief that unsupervised
variable selection (i.e. ignoring the outcome) can be applied before conducting
CV without incurring bias, but there is less consensus for unsupervised
imputation of missing values. We empirically assessed whether conducting
unsupervised imputation prior to CV would result in biased estimates of
generalization error or result in poorly selected tuning parameters and thus
degrade the external performance of downstream models. Results show that
despite optimistic bias, the reduced variance of imputation before CV compared
to imputation during each replicate of CV leads to a lower overall root mean
squared error for estimation of the true external R-squared and the performance
of models tuned using CV with imputation before versus during each replication
is minimally different. In conclusion, unsupervised imputation before CV
appears valid in certain settings and may be a helpful strategy that enables
analysts to use more flexible imputation techniques without incurring high
computational costs.
Related papers
- Risk and cross validation in ridge regression with correlated samples [72.59731158970894]
We provide training examples for the in- and out-of-sample risks of ridge regression when the data points have arbitrary correlations.
We further extend our analysis to the case where the test point has non-trivial correlations with the training set, setting often encountered in time series forecasting.
We validate our theory across a variety of high dimensional data.
arXiv Detail & Related papers (2024-08-08T17:27:29Z) - Iterative Approximate Cross-Validation [13.084578404699174]
Cross-validation (CV) is one of the most popular tools for assessing and selecting predictive models.
In this paper, we propose a new paradigm to efficiently approximate CV when the empirical risk minimization (ERM) problem is solved via an iterative first-order algorithm.
Our new method extends existing guarantees for CV approximation to hold along the whole trajectory of the algorithm, including at convergence.
arXiv Detail & Related papers (2023-03-05T17:56:08Z) - Toward Theoretical Guidance for Two Common Questions in Practical
Cross-Validation based Hyperparameter Selection [72.76113104079678]
We show the first theoretical treatments of two common questions in cross-validation based hyperparameter selection.
We show that these generalizations can, respectively, always perform at least as well as always performing retraining or never performing retraining.
arXiv Detail & Related papers (2023-01-12T16:37:12Z) - Efficient and Differentiable Conformal Prediction with General Function
Classes [96.74055810115456]
We propose a generalization of conformal prediction to multiple learnable parameters.
We show that it achieves approximate valid population coverage and near-optimal efficiency within class.
Experiments show that our algorithm is able to learn valid prediction sets and improve the efficiency significantly.
arXiv Detail & Related papers (2022-02-22T18:37:23Z) - Confidence intervals for the Cox model test error from cross-validation [91.3755431537592]
Cross-validation (CV) is one of the most widely used techniques in statistical learning for estimating the test error of a model.
Standard confidence intervals for test error using estimates from CV may have coverage below nominal levels.
One way to this issue is by estimating the mean squared error of the prediction error instead using nested CV.
arXiv Detail & Related papers (2022-01-26T06:40:43Z) - Can we globally optimize cross-validation loss? Quasiconvexity in ridge
regression [38.18195443944592]
We show that in the case of ridge regression, the CV loss may fail to be quasi research and may have multiple local optima.
More generally, we show that quasi-flatity status is independent of many properties of optimum data responses.
arXiv Detail & Related papers (2021-07-19T23:22:24Z) - Imputation-Free Learning from Incomplete Observations [73.15386629370111]
We introduce the importance of guided gradient descent (IGSGD) method to train inference from inputs containing missing values without imputation.
We employ reinforcement learning (RL) to adjust the gradients used to train the models via back-propagation.
Our imputation-free predictions outperform the traditional two-step imputation-based predictions using state-of-the-art imputation methods.
arXiv Detail & Related papers (2021-07-05T12:44:39Z) - Approximate Cross-validated Mean Estimates for Bayesian Hierarchical Regression Models [6.824747267214373]
We introduce a novel procedure for obtaining cross-validated predictive estimates for Bayesian hierarchical regression models.
We provide theoretical results and demonstrate its efficacy on publicly available data and in simulations.
arXiv Detail & Related papers (2020-11-29T00:00:20Z) - Pre-training Is (Almost) All You Need: An Application to Commonsense
Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks.
We introduce a new scoring method that casts a plausibility ranking task in a full-text format.
We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z) - Approximate Cross-validation: Guarantees for Model Assessment and
Selection [18.77512692975483]
Cross-validation (CV) is a popular approach for assessing and selecting predictive models.
Recent work in empirical risk minimization approximates the expensive refitting with a single Newton warm-started from the full training set.
arXiv Detail & Related papers (2020-03-02T00:30:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.