Fast cross-validation for multi-penalty ridge regression
- URL: http://arxiv.org/abs/2005.09301v2
- Date: Thu, 1 Apr 2021 07:52:28 GMT
- Title: Fast cross-validation for multi-penalty ridge regression
- Authors: Mark A. van de Wiel, Mirrelijn M. van Nee, Armin Rauschenberger
- Abstract summary: Ridge regression is a simple model for high-dimensional data.
Our main contribution is a computationally very efficient formula for the multi-penalty, sample-weighted hat-matrix.
Extensions to paired and preferential data types are included and illustrated on several cancer genomics survival prediction problems.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: High-dimensional prediction with multiple data types needs to account for
potentially strong differences in predictive signal. Ridge regression is a
simple model for high-dimensional data that has challenged the predictive
performance of many more complex models and learners, and that allows inclusion
of data type specific penalties. The largest challenge for multi-penalty ridge
is to optimize these penalties efficiently in a cross-validation (CV) setting,
in particular for GLM and Cox ridge regression, which require an additional
estimation loop by iterative weighted least squares (IWLS). Our main
contribution is a computationally very efficient formula for the multi-penalty,
sample-weighted hat-matrix, as used in the IWLS algorithm. As a result, nearly
all computations are in low-dimensional space, rendering a speed-up of several
orders of magnitude. We developed a flexible framework that facilitates
multiple types of response, unpenalized covariates, several performance
criteria and repeated CV. Extensions to paired and preferential data types are
included and illustrated on several cancer genomics survival prediction
problems. Moreover, we present similar computational shortcuts for maximum
marginal likelihood and Bayesian probit regression. The corresponding
R-package, multiridge, serves as a versatile standalone tool, but also as a
fast benchmark for other more complex models and multi-view learners.
Related papers
- An Efficient Algorithm for Clustered Multi-Task Compressive Sensing [60.70532293880842]
Clustered multi-task compressive sensing is a hierarchical model that solves multiple compressive sensing tasks.
The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions.
We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices.
arXiv Detail & Related papers (2023-09-30T15:57:14Z) - Multiple Augmented Reduced Rank Regression for Pan-Cancer Analysis [0.0]
We propose multiple augmented reduced rank regression (maRRR), a flexible matrix regression and factorization method.
We consider a structured nuclear norm objective that is motivated by random matrix theory.
We apply maRRR to gene expression data from multiple cancer types (i.e., pan-cancer) from TCGA.
arXiv Detail & Related papers (2023-08-30T21:40:58Z) - Probabilistic Unrolling: Scalable, Inverse-Free Maximum Likelihood
Estimation for Latent Gaussian Models [69.22568644711113]
We introduce probabilistic unrolling, a method that combines Monte Carlo sampling with iterative linear solvers to circumvent matrix inversions.
Our theoretical analyses reveal that unrolling and backpropagation through the iterations of the solver can accelerate gradient estimation for maximum likelihood estimation.
In experiments on simulated and real data, we demonstrate that probabilistic unrolling learns latent Gaussian models up to an order of magnitude faster than gradient EM, with minimal losses in model performance.
arXiv Detail & Related papers (2023-06-05T21:08:34Z) - Scalable Estimation for Structured Additive Distributional Regression [0.0]
We propose a novel backfitting algorithm, which is based on the ideas of gradient descent and can deal virtually with any amount of data on a conventional laptop.
Performance is evaluated using an extensive simulation study and an exceptionally challenging and unique example of lightning count prediction over Austria.
arXiv Detail & Related papers (2023-01-13T14:59:42Z) - Sparse high-dimensional linear regression with a partitioned empirical
Bayes ECM algorithm [62.997667081978825]
We propose a computationally efficient and powerful Bayesian approach for sparse high-dimensional linear regression.
Minimal prior assumptions on the parameters are used through the use of plug-in empirical Bayes estimates.
The proposed approach is implemented in the R package probe.
arXiv Detail & Related papers (2022-09-16T19:15:50Z) - Consensual Aggregation on Random Projected High-dimensional Features for
Regression [0.0]
We present a study of a kernel-based consensual aggregation on randomly projected high-dimensional features of predictions for regression.
We numerically illustrate that the aggregation scheme upholds its performance on very large and highly correlated features.
The efficiency of the proposed method is illustrated through several experiments evaluated on different types of synthetic and real datasets.
arXiv Detail & Related papers (2022-04-06T06:35:47Z) - Parallel integrative learning for large-scale multi-response regression
with incomplete outcomes [1.7403133838762448]
In the era of big data, the coexistence of incomplete outcomes, large number of responses, and high dimensionality in predictors poses unprecedented challenges in estimation, prediction, and computation.
We propose a scalable and computationally efficient procedure, called PEER, for large-scale multi-response regression with incomplete outcomes.
Under some mild regularity conditions, we show that PEER enjoys nice sampling properties including consistency in estimation, prediction, and variable selection.
arXiv Detail & Related papers (2021-04-11T19:01:24Z) - A Hypergradient Approach to Robust Regression without Correspondence [85.49775273716503]
We consider a variant of regression problem, where the correspondence between input and output data is not available.
Most existing methods are only applicable when the sample size is small.
We propose a new computational framework -- ROBOT -- for the shuffled regression problem.
arXiv Detail & Related papers (2020-11-30T21:47:38Z) - Generalized Matrix Factorization: efficient algorithms for fitting
generalized linear latent variable models to large data arrays [62.997667081978825]
Generalized Linear Latent Variable models (GLLVMs) generalize such factor models to non-Gaussian responses.
Current algorithms for estimating model parameters in GLLVMs require intensive computation and do not scale to large datasets.
We propose a new approach for fitting GLLVMs to high-dimensional datasets, based on approximating the model using penalized quasi-likelihood.
arXiv Detail & Related papers (2020-10-06T04:28:19Z) - Two-step penalised logistic regression for multi-omic data with an
application to cardiometabolic syndrome [62.997667081978825]
We implement a two-step approach to multi-omic logistic regression in which variable selection is performed on each layer separately.
Our approach should be preferred if the goal is to select as many relevant predictors as possible.
Our proposed approach allows us to identify features that characterise cardiometabolic syndrome at the molecular level.
arXiv Detail & Related papers (2020-08-01T10:36:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.