Comparing regularisation paths of (conjugate) gradient estimators in ridge regression
- URL: http://arxiv.org/abs/2503.05542v2
- Date: Mon, 10 Mar 2025 07:25:09 GMT
- Title: Comparing regularisation paths of (conjugate) gradient estimators in ridge regression
- Authors: Laura Hucker, Markus Reiß, Thomas Stark,
- Abstract summary: We consider gradient descent, gradient flow and conjugate gradients as iterative algorithms for minimizing a penalized ridge criterion in linear regression.<n>In particular, the oracle conjugate gradient iterate shares the optimality properties of the gradient flow and ridge regression oracles up to a constant factor.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider standard gradient descent, gradient flow and conjugate gradients as iterative algorithms for minimizing a penalized ridge criterion in linear regression. While it is well known that conjugate gradients exhibit fast numerical convergence, the statistical properties of their iterates are more difficult to assess due to inherent nonlinearities and dependencies. On the other hand, standard gradient flow is a linear method with well known regularizing properties when stopped early. By an explicit non-standard error decomposition we are able to bound the prediction error for conjugate gradient iterates by a corresponding prediction error of gradient flow at transformed iteration indices. This way, the risk along the entire regularisation path of conjugate gradient iterations can be compared to that for regularisation paths of standard linear methods like gradient flow and ridge regression. In particular, the oracle conjugate gradient iterate shares the optimality properties of the gradient flow and ridge regression oracles up to a constant factor. Numerical examples show the similarity of the regularisation paths in practice.
Related papers
- Gradient Equilibrium in Online Learning: Theory and Applications [56.02856551198923]
gradient equilibrium is achieved by standard online learning methods.<n> gradient equilibrium translates into an interpretable and meaningful property in online prediction problems.<n>We show that gradient equilibrium framework can be used to develop a debiasing scheme for black-box predictions.
arXiv Detail & Related papers (2025-01-14T18:59:09Z) - Limit Theorems for Stochastic Gradient Descent with Infinite Variance [47.87144151929621]
We show that the gradient descent algorithm can be characterized as the stationary distribution of a suitably defined Ornstein-rnstein process driven by an appropriate L'evy process.
We also explore the applications of these results in linear regression and logistic regression models.
arXiv Detail & Related papers (2024-10-21T09:39:10Z) - Fisher-Rao Gradient Flows of Linear Programs and State-Action Natural Policy Gradients [15.218434620361387]
We study another natural gradient method based on the Fisher information matrix of the state-action distributions.<n>We show sublinear convergence for Fisher-Rao gradient flows and natural gradient flows up to an approximation error.
arXiv Detail & Related papers (2024-03-28T14:16:23Z) - Stochastic Gradient Descent for Gaussian Processes Done Right [86.83678041846971]
We show that when emphdone right -- by which we mean using specific insights from optimisation and kernel communities -- gradient descent is highly effective.
We introduce a emphstochastic dual descent algorithm, explain its design in an intuitive manner and illustrate the design choices.
Our method places Gaussian process regression on par with state-of-the-art graph neural networks for molecular binding affinity prediction.
arXiv Detail & Related papers (2023-10-31T16:15:13Z) - The Implicit Bias of Batch Normalization in Linear Models and Two-layer
Linear Convolutional Neural Networks [117.93273337740442]
We show that gradient descent converges to a uniform margin classifier on the training data with an $exp(-Omega(log2 t))$ convergence rate.
We also show that batch normalization has an implicit bias towards a patch-wise uniform margin.
arXiv Detail & Related papers (2023-06-20T16:58:00Z) - Gradients should stay on Path: Better Estimators of the Reverse- and
Forward KL Divergence for Normalizing Flows [4.830811539001643]
We propose an algorithm to estimate the path-gradient of both the reverse and forward Kullback-Leibler divergence for an arbitrary manifestly invertible normalizing flow.
The resulting path-gradient estimators are straightforward to implement, have lower variance, and lead not only to faster convergence of training but also to better overall approximation results.
arXiv Detail & Related papers (2022-07-17T16:27:41Z) - Stability vs Implicit Bias of Gradient Methods on Separable Data and
Beyond [33.593203156666746]
We focus on the generalization properties of unregularized gradient-based learning procedures applied to separable linear classification.
We give an additional unified explanation for this generalization, that we refer to as realizability and self-boundedness.
In some of these cases, our bounds significantly improve upon the existing generalization error bounds in the literature.
arXiv Detail & Related papers (2022-02-27T19:56:36Z) - Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically.
This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression.
We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z) - Tighter Bounds on the Log Marginal Likelihood of Gaussian Process
Regression Using Conjugate Gradients [19.772149500352945]
We show that approximate maximum likelihood learning of model parameters by maximising our lower bound retains many of the sparse variational approach benefits.
In experiments, we show improved predictive performance with our model for a comparable amount of training time compared to other conjugate gradient based approaches.
arXiv Detail & Related papers (2021-02-16T17:54:59Z) - Implicit Gradient Regularization [18.391141066502644]
gradient descent can be surprisingly good at optimizing deep neural networks without overfitting and without explicit regularization.
We call this Implicit Gradient Regularization (IGR) and we use backward error analysis to calculate the size of this regularization.
arXiv Detail & Related papers (2020-09-23T14:17:53Z) - Channel-Directed Gradients for Optimization of Convolutional Neural
Networks [50.34913837546743]
We introduce optimization methods for convolutional neural networks that can be used to improve existing gradient-based optimization in terms of generalization error.
We show that defining the gradients along the output channel direction leads to a performance boost, while other directions can be detrimental.
arXiv Detail & Related papers (2020-08-25T00:44:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.