Predictive Performance Test based on the Exhaustive Nested Cross-Validation for High-dimensional data
- URL: http://arxiv.org/abs/2408.03138v1
- Date: Tue, 6 Aug 2024 12:28:16 GMT
- Title: Predictive Performance Test based on the Exhaustive Nested Cross-Validation for High-dimensional data
- Authors: Iris Ivy Gauran, Hernando Ombao, Zhaoxia Yu,
- Abstract summary: Cross-validation is used for several tasks such as estimating the prediction error, tuning the regularization parameter, and selecting the most suitable predictive model.
The K-fold cross-validation is a popular CV method but its limitation is that the risk estimates are highly dependent on the partitioning of the data.
This study presents an alternative novel predictive performance test and valid confidence intervals based on exhaustive nested cross-validation.
- Score: 7.62566998854384
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: It is crucial to assess the predictive performance of a model in order to establish its practicality and relevance in real-world scenarios, particularly for high-dimensional data analysis. Among data splitting or resampling methods, cross-validation (CV) is extensively used for several tasks such as estimating the prediction error, tuning the regularization parameter, and selecting the most suitable predictive model among competing alternatives. The K-fold cross-validation is a popular CV method but its limitation is that the risk estimates are highly dependent on the partitioning of the data (for training and testing). Here, the issues regarding the reproducibility of the K-fold CV estimator is demonstrated in hypothesis testing wherein different partitions lead to notably disparate conclusions. This study presents an alternative novel predictive performance test and valid confidence intervals based on exhaustive nested cross-validation for determining the difference in prediction error between two model-fitting algorithms. A naive implementation of the exhaustive nested cross-validation is computationally costly. Here, we address concerns regarding computational complexity by devising a computationally tractable closed-form expression for the proposed cross-validation estimator using ridge regularization. Our study also investigates strategies aimed at enhancing statistical power within high-dimensional scenarios while controlling the Type I error rate. To illustrate the practical utility of our method, we apply it to an RNA sequencing study and demonstrate its effectiveness in the context of biological data analysis.
Related papers
- Ranking and Combining Latent Structured Predictive Scores without Labeled Data [2.5064967708371553]
This paper introduces a novel structured unsupervised ensemble learning model (SUEL)
It exploits the dependency between a set of predictors with continuous predictive scores, rank the predictors without labeled data and combine them to an ensembled score with weights.
The efficacy of the proposed methods is rigorously assessed through both simulation studies and real-world application of risk genes discovery.
arXiv Detail & Related papers (2024-08-14T20:14:42Z) - Risk and cross validation in ridge regression with correlated samples [72.59731158970894]
We provide training examples for the in- and out-of-sample risks of ridge regression when the data points have arbitrary correlations.
We further extend our analysis to the case where the test point has non-trivial correlations with the training set, setting often encountered in time series forecasting.
We validate our theory across a variety of high dimensional data.
arXiv Detail & Related papers (2024-08-08T17:27:29Z) - Bootstrapping the Cross-Validation Estimate [3.5159221757909656]
Cross-validation is a widely used technique for evaluating the performance of prediction models.
It is essential to accurately quantify the uncertainty associated with the estimate.
This paper proposes a fast bootstrap method that quickly estimates the standard error of the cross-validation estimate.
arXiv Detail & Related papers (2023-07-01T07:50:54Z) - Conformal prediction for the design problem [72.14982816083297]
In many real-world deployments of machine learning, we use a prediction algorithm to choose what data to test next.
In such settings, there is a distinct type of distribution shift between the training and test data.
We introduce a method to quantify predictive uncertainty in such settings.
arXiv Detail & Related papers (2022-02-08T02:59:12Z) - Self-Certifying Classification by Linearized Deep Assignment [65.0100925582087]
We propose a novel class of deep predictors for classifying metric data on graphs within PAC-Bayes risk certification paradigm.
Building on the recent PAC-Bayes literature and data-dependent priors, this approach enables learning posterior distributions on the hypothesis space.
arXiv Detail & Related papers (2022-01-26T19:59:14Z) - Approximate Cross-validated Mean Estimates for Bayesian Hierarchical Regression Models [6.824747267214373]
We introduce a novel procedure for obtaining cross-validated predictive estimates for Bayesian hierarchical regression models.
We provide theoretical results and demonstrate its efficacy on publicly available data and in simulations.
arXiv Detail & Related papers (2020-11-29T00:00:20Z) - Robust Validation: Confident Predictions Even When Distributions Shift [19.327409270934474]
We describe procedures for robust predictive inference, where a model provides uncertainty estimates on its predictions rather than point predictions.
We present a method that produces prediction sets (almost exactly) giving the right coverage level for any test distribution in an $f$-divergence ball around the training population.
An essential component of our methodology is to estimate the amount of expected future data shift and build robustness to it.
arXiv Detail & Related papers (2020-08-10T17:09:16Z) - Balance-Subsampled Stable Prediction [55.13512328954456]
We propose a novel balance-subsampled stable prediction (BSSP) algorithm based on the theory of fractional factorial design.
A design-theoretic analysis shows that the proposed method can reduce the confounding effects among predictors induced by the distribution shift.
Numerical experiments on both synthetic and real-world data sets demonstrate that our BSSP algorithm significantly outperforms the baseline methods for stable prediction across unknown test data.
arXiv Detail & Related papers (2020-06-08T07:01:38Z) - Estimating the Prediction Performance of Spatial Models via Spatial
k-Fold Cross Validation [1.7205106391379026]
In machine learning one often assumes the data are independent when evaluating model performance.
spatial autocorrelation (SAC) causes the standard cross validation (CV) methods to produce optimistically biased prediction performance estimates.
We propose a modified version of the CV method called spatial k-fold cross validation (SKCV) which provides a useful estimate for model prediction performance without optimistic bias due to SAC.
arXiv Detail & Related papers (2020-05-28T19:55:18Z) - Machine learning for causal inference: on the use of cross-fit
estimators [77.34726150561087]
Doubly-robust cross-fit estimators have been proposed to yield better statistical properties.
We conducted a simulation study to assess the performance of several estimators for the average causal effect (ACE)
When used with machine learning, the doubly-robust cross-fit estimators substantially outperformed all of the other estimators in terms of bias, variance, and confidence interval coverage.
arXiv Detail & Related papers (2020-04-21T23:09:55Z) - Asymptotic Analysis of an Ensemble of Randomly Projected Linear
Discriminants [94.46276668068327]
In [1], an ensemble of randomly projected linear discriminants is used to classify datasets.
We develop a consistent estimator of the misclassification probability as an alternative to the computationally-costly cross-validation estimator.
We also demonstrate the use of our estimator for tuning the projection dimension on both real and synthetic data.
arXiv Detail & Related papers (2020-04-17T12:47:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.