Comparing Classes of Estimators: When does Gradient Descent Beat Ridge
Regression in Linear Models?
- URL: http://arxiv.org/abs/2108.11872v1
- Date: Thu, 26 Aug 2021 16:01:37 GMT
- Title: Comparing Classes of Estimators: When does Gradient Descent Beat Ridge
Regression in Linear Models?
- Authors: Dominic Richards, Edgar Dobriban, Patrick Rebeschini
- Abstract summary: We compare classes of estimators via the relative performance of the emphbest method in the class
This allows us to rigorously quantify the tuning sensitivity of learning algorithms.
- Score: 46.01087792062936
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Modern methods for learning from data depend on many tuning parameters, such
as the stepsize for optimization methods, and the regularization strength for
regularized learning methods. Since performance can depend strongly on these
parameters, it is important to develop comparisons between \emph{classes of
methods}, not just for particularly tuned ones. Here, we take aim to compare
classes of estimators via the relative performance of the \emph{best method in
the class}. This allows us to rigorously quantify the tuning sensitivity of
learning algorithms. As an illustration, we investigate the statistical
estimation performance of ridge regression with a uniform grid of
regularization parameters, and of gradient descent iterates with a fixed
stepsize, in the standard linear model with a random isotropic ground truth
parameter.
(1) For orthogonal designs, we find the \emph{exact minimax optimal classes
of estimators}, showing they are equal to gradient descent with a polynomially
decaying learning rate. We find the exact suboptimalities of ridge regression
and gradient descent with a fixed stepsize, showing that they decay as either
$1/k$ or $1/k^2$ for specific ranges of $k$ estimators.
(2) For general designs with a large number of non-zero eigenvalues, we find
that gradient descent outperforms ridge regression when the eigenvalues decay
slowly, as a power law with exponent less than unity. If instead the
eigenvalues decay quickly, as a power law with exponent greater than unity or
exponentially, we find that ridge regression outperforms gradient descent.
Our results highlight the importance of tuning parameters. In particular,
while optimally tuned ridge regression is the best estimator in our case, it
can be outperformed by gradient descent when both are restricted to being tuned
over a finite regularization grid.
Related papers
- Stagewise Boosting Distributional Regression [0.0]
We propose a stagewise boosting-type algorithm for distributional regression.
We extend it with a novel regularization method, correlation filtering, to provide additional stability.
Besides the advantage of processing large datasets, the nature of the approximations can lead to better results.
arXiv Detail & Related papers (2024-05-28T15:40:39Z) - ELRA: Exponential learning rate adaption gradient descent optimization
method [83.88591755871734]
We present a novel, fast (exponential rate), ab initio (hyper-free) gradient based adaption.
The main idea of the method is to adapt the $alpha by situational awareness.
It can be applied to problems of any dimensions n and scales only linearly.
arXiv Detail & Related papers (2023-09-12T14:36:13Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Improved Convergence Rate of Stochastic Gradient Langevin Dynamics with
Variance Reduction and its Application to Optimization [50.83356836818667]
gradient Langevin Dynamics is one of the most fundamental algorithms to solve non-eps optimization problems.
In this paper, we show two variants of this kind, namely the Variance Reduced Langevin Dynamics and the Recursive Gradient Langevin Dynamics.
arXiv Detail & Related papers (2022-03-30T11:39:00Z) - Tom: Leveraging trend of the observed gradients for faster convergence [0.0]
Tom is a novel variant of Adam that takes into account the trend observed for the gradients in the landscape in the loss traversed by the neural network.
Tom outperforms Adagrad, Adadelta, RMSProp and Adam in terms of both accuracy and has a faster convergence.
arXiv Detail & Related papers (2021-09-07T20:19:40Z) - High-probability Bounds for Non-Convex Stochastic Optimization with
Heavy Tails [55.561406656549686]
We consider non- Hilbert optimization using first-order algorithms for which the gradient estimates may have tails.
We show that a combination of gradient, momentum, and normalized gradient descent convergence to critical points in high-probability with best-known iteration for smooth losses.
arXiv Detail & Related papers (2021-06-28T00:17:01Z) - Self-Tuning Stochastic Optimization with Curvature-Aware Gradient
Filtering [53.523517926927894]
We explore the use of exact per-sample Hessian-vector products and gradients to construct self-tuning quadratics.
We prove that our model-based procedure converges in noisy gradient setting.
This is an interesting step for constructing self-tuning quadratics.
arXiv Detail & Related papers (2020-11-09T22:07:30Z) - Reparametrizing gradient descent [0.0]
We propose an optimization algorithm which we call norm-adapted gradient descent.
Our algorithm can also be compared to quasi-Newton methods, but we seek roots rather than stationary points.
arXiv Detail & Related papers (2020-10-09T20:22:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.