Extrapolated cross-validation for randomized ensembles
- URL: http://arxiv.org/abs/2302.13511v3
- Date: Fri, 15 Dec 2023 21:13:09 GMT
- Title: Extrapolated cross-validation for randomized ensembles
- Authors: Jin-Hong Du, Pratik Patil, Kathryn Roeder, Arun Kumar Kuchibhotla
- Abstract summary: This paper introduces a cross-validation method, ECV, for tuning the ensemble and subsample sizes in randomized ensembles.
We show that ECV yields $delta$-optimal ensembles for squared prediction risk.
In comparison to sample-split cross-validation and $K$-fold cross-validation, ECV achieves higher accuracy avoiding sample splitting.
- Score: 2.3609229325947885
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Ensemble methods such as bagging and random forests are ubiquitous in various
fields, from finance to genomics. Despite their prevalence, the question of the
efficient tuning of ensemble parameters has received relatively little
attention. This paper introduces a cross-validation method, ECV (Extrapolated
Cross-Validation), for tuning the ensemble and subsample sizes in randomized
ensembles. Our method builds on two primary ingredients: initial estimators for
small ensemble sizes using out-of-bag errors and a novel risk extrapolation
technique that leverages the structure of prediction risk decomposition. By
establishing uniform consistency of our risk extrapolation technique over
ensemble and subsample sizes, we show that ECV yields $\delta$-optimal (with
respect to the oracle-tuned risk) ensembles for squared prediction risk. Our
theory accommodates general ensemble predictors, only requires mild moment
assumptions, and allows for high-dimensional regimes where the feature
dimension grows with the sample size. As a practical case study, we employ ECV
to predict surface protein abundances from gene expressions in single-cell
multiomics using random forests. In comparison to sample-split cross-validation
and $K$-fold cross-validation, ECV achieves higher accuracy avoiding sample
splitting. At the same time, its computational cost is considerably lower owing
to the use of the risk extrapolation technique. Additional numerical results
validate the finite-sample accuracy of ECV for several common ensemble
predictors under a computational constraint on the maximum ensemble size.
Related papers
- Semiparametric conformal prediction [79.6147286161434]
Risk-sensitive applications require well-calibrated prediction sets over multiple, potentially correlated target variables.
We treat the scores as random vectors and aim to construct the prediction set accounting for their joint correlation structure.
We report desired coverage and competitive efficiency on a range of real-world regression problems.
arXiv Detail & Related papers (2024-11-04T14:29:02Z) - Precise Asymptotics of Bagging Regularized M-estimators [5.165142221427928]
We characterize the squared prediction risk of ensemble estimators obtained through subagging (subsample bootstrap aggregating) regularized M-estimators.
Key to our analysis is a new result on the joint behavior of correlations between the estimator and residual errors on overlapping subsamples.
Joint optimization of subsample size, ensemble size, and regularization can significantly outperform regularizer optimization alone on the full data.
arXiv Detail & Related papers (2024-09-23T17:48:28Z) - Risk and cross validation in ridge regression with correlated samples [72.59731158970894]
We provide training examples for the in- and out-of-sample risks of ridge regression when the data points have arbitrary correlations.
We further extend our analysis to the case where the test point has non-trivial correlations with the training set, setting often encountered in time series forecasting.
We validate our theory across a variety of high dimensional data.
arXiv Detail & Related papers (2024-08-08T17:27:29Z) - ROTI-GCV: Generalized Cross-Validation for right-ROTationally Invariant Data [1.194799054956877]
Two key tasks in high-dimensional regularized regression are tuning the regularization strength for accurate predictions and estimating the out-of-sample risk.
We introduce a new framework, ROTI-GCV, for reliably performing cross-validation under challenging conditions.
arXiv Detail & Related papers (2024-06-17T15:50:00Z) - Optimal Multi-Distribution Learning [88.3008613028333]
Multi-distribution learning seeks to learn a shared model that minimizes the worst-case risk across $k$ distinct data distributions.
We propose a novel algorithm that yields an varepsilon-optimal randomized hypothesis with a sample complexity on the order of (d+k)/varepsilon2.
arXiv Detail & Related papers (2023-12-08T16:06:29Z) - Asymptotically free sketched ridge ensembles: Risks, cross-validation, and tuning [5.293069542318491]
We employ random matrix theory to establish consistency of generalized cross validation (GCV) for estimating prediction risks of sketched ridge regression ensembles.
For squared prediction risk, we provide a decomposition into an unsketched equivalent implicit ridge bias and a sketching-based variance, and prove that the risk can be globally tuning by only sketch size in infinite ensembles.
We also propose an "ensemble trick" whereby the risk for unsketched ridge regression can be efficiently estimated via GCV using small sketched ridge ensembles.
arXiv Detail & Related papers (2023-10-06T16:27:43Z) - Subsample Ridge Ensembles: Equivalences and Generalized Cross-Validation [4.87717454493713]
We study subsampling-based ridge ensembles in the proportionals regime.
We prove that the risk of the optimal full ridgeless ensemble (fitted on all possible subsamples) matches that of the optimal ridge predictor.
arXiv Detail & Related papers (2023-04-25T17:43:27Z) - Bagging in overparameterized learning: Risk characterization and risk
monotonization [2.6534407766508177]
We study the prediction risk of variants of bagged predictors under the proportionals regime.
Specifically, we propose a general strategy to analyze the prediction risk under squared error loss of bagged predictors.
arXiv Detail & Related papers (2022-10-20T17:45:58Z) - Mitigating multiple descents: A model-agnostic framework for risk
monotonization [84.6382406922369]
We develop a general framework for risk monotonization based on cross-validation.
We propose two data-driven methodologies, namely zero- and one-step, that are akin to bagging and boosting.
arXiv Detail & Related papers (2022-05-25T17:41:40Z) - Self-Certifying Classification by Linearized Deep Assignment [65.0100925582087]
We propose a novel class of deep predictors for classifying metric data on graphs within PAC-Bayes risk certification paradigm.
Building on the recent PAC-Bayes literature and data-dependent priors, this approach enables learning posterior distributions on the hypothesis space.
arXiv Detail & Related papers (2022-01-26T19:59:14Z) - Asymptotic Analysis of an Ensemble of Randomly Projected Linear
Discriminants [94.46276668068327]
In [1], an ensemble of randomly projected linear discriminants is used to classify datasets.
We develop a consistent estimator of the misclassification probability as an alternative to the computationally-costly cross-validation estimator.
We also demonstrate the use of our estimator for tuning the projection dimension on both real and synthetic data.
arXiv Detail & Related papers (2020-04-17T12:47:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.