RandALO: Out-of-sample risk estimation in no time flat
- URL: http://arxiv.org/abs/2409.09781v1
- Date: Sun, 15 Sep 2024 16:10:03 GMT
- Title: RandALO: Out-of-sample risk estimation in no time flat
- Authors: Parth T. Nobel, Daniel LeJeune, Emmanuel J. Candès,
- Abstract summary: Cross-validation (CV) serves as the de facto standard for risk estimation but poorly trades off high bias ($K$-fold CV) for computational cost (leave-one-out CV)
We propose a randomized approximate leave-one-out (RandALO) risk estimator that is not only a consistent estimator of risk in high dimensions but also less computationally expensive than $K$-fold CV.
- Score: 5.231056284485742
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Estimating out-of-sample risk for models trained on large high-dimensional datasets is an expensive but essential part of the machine learning process, enabling practitioners to optimally tune hyperparameters. Cross-validation (CV) serves as the de facto standard for risk estimation but poorly trades off high bias ($K$-fold CV) for computational cost (leave-one-out CV). We propose a randomized approximate leave-one-out (RandALO) risk estimator that is not only a consistent estimator of risk in high dimensions but also less computationally expensive than $K$-fold CV. We support our claims with extensive simulations on synthetic and real data and provide a user-friendly Python package implementing RandALO available on PyPI as randalo and at https://github.com/cvxgrp/randalo.
Related papers
- Risk and cross validation in ridge regression with correlated samples [72.59731158970894]
We provide training examples for the in- and out-of-sample risks of ridge regression when the data points have arbitrary correlations.
We further extend our analysis to the case where the test point has non-trivial correlations with the training set, setting often encountered in time series forecasting.
We validate our theory across a variety of high dimensional data.
arXiv Detail & Related papers (2024-08-08T17:27:29Z) - Lost in the Averages: A New Specific Setup to Evaluate Membership Inference Attacks Against Machine Learning Models [6.343040313814916]
Membership Inference Attacks (MIAs) are used to evaluate the propensity of a machine learning (ML) model to memorize an individual record.
We propose a new, specific evaluation setup for MIAs against ML models.
We show that the risk estimates given by the current setup lead to many records being misclassified as low risk.
arXiv Detail & Related papers (2024-05-24T10:37:38Z) - Sparse PCA with Oracle Property [115.72363972222622]
We propose a family of estimators based on the semidefinite relaxation of sparse PCA with novel regularizations.
We prove that, another estimator within the family achieves a sharper statistical rate of convergence than the standard semidefinite relaxation of sparse PCA.
arXiv Detail & Related papers (2023-12-28T02:52:54Z) - Provably Efficient CVaR RL in Low-rank MDPs [58.58570425202862]
We study risk-sensitive Reinforcement Learning (RL)
We propose a novel Upper Confidence Bound (UCB) bonus-driven algorithm to balance interplay between exploration, exploitation, and representation learning in CVaR RL.
We prove that our algorithm achieves a sample complexity of $epsilon$-optimal CVaR, where $H$ is the length of each episode, $A$ is the capacity of action space, and $d$ is the dimension of representations.
arXiv Detail & Related papers (2023-11-20T17:44:40Z) - Empirical Risk Minimization for Losses without Variance [26.30435936379624]
This paper considers an empirical risk problem under heavy-tailed settings, where data does not have finite variance, but only has $p$-th moment with $p in (1,2)$.
Instead of using estimation procedure based on truncated observed data, we choose the minimization by minimizing the risk value.
Those risk values can be robustly estimated via using the remarkable Catoni's method (Catoni, 2012).
arXiv Detail & Related papers (2023-09-07T16:14:00Z) - Robust leave-one-out cross-validation for high-dimensional Bayesian
models [0.0]
Leave-one-out cross-validation (LOO-CV) is a popular method for estimating out-of-sample predictive accuracy.
Here we propose and analyze a novel mixture estimator to compute LOO-CV criteria.
Our method retains the simplicity and computational convenience of classical approaches, while guaranteeing finite variance of the resulting estimators.
arXiv Detail & Related papers (2022-09-19T17:14:52Z) - Learning to be a Statistician: Learned Estimator for Number of Distinct
Values [54.629042119819744]
Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems.
In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples.
We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator.
arXiv Detail & Related papers (2022-02-06T15:42:04Z) - Dimensionality reduction, regularization, and generalization in
overparameterized regressions [8.615625517708324]
We show that PCA-OLS, also known as principal component regression, can be avoided with a dimensionality reduction.
We show that dimensionality reduction improves robustness while OLS is arbitrarily susceptible to adversarial attacks.
We find that methods in which the projection depends on the training data can outperform methods where the projections are chosen independently of the training data.
arXiv Detail & Related papers (2020-11-23T15:38:50Z) - Optimal Best-Arm Identification Methods for Tail-Risk Measures [9.128264779870538]
Conditional value-at-risk (CVaR) and value-at-risk (VaR) are popular tail-risk measures in finance and insurance industries.
We identify the smallest CVaR, VaR, or sum of CVaR and mean from amongst finitely that has smallest CVaR, VaR, or sum of CVaR and mean.
arXiv Detail & Related papers (2020-08-17T20:23:24Z) - Sharp Statistical Guarantees for Adversarially Robust Gaussian
Classification [54.22421582955454]
We provide the first result of the optimal minimax guarantees for the excess risk for adversarially robust classification.
Results are stated in terms of the Adversarial Signal-to-Noise Ratio (AdvSNR), which generalizes a similar notion for standard linear classification to the adversarial setting.
arXiv Detail & Related papers (2020-06-29T21:06:52Z) - Nonparametric Estimation in the Dynamic Bradley-Terry Model [69.70604365861121]
We develop a novel estimator that relies on kernel smoothing to pre-process the pairwise comparisons over time.
We derive time-varying oracle bounds for both the estimation error and the excess risk in the model-agnostic setting.
arXiv Detail & Related papers (2020-02-28T21:52:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.