On the Benefits of Large Learning Rates for Kernel Methods
- URL: http://arxiv.org/abs/2202.13733v1
- Date: Mon, 28 Feb 2022 13:01:04 GMT
- Title: On the Benefits of Large Learning Rates for Kernel Methods
- Authors: Gaspard Beugnot, Julien Mairal, Alessandro Rudi
- Abstract summary: We show that a phenomenon can be precisely characterized in the context of kernel methods.
We consider the minimization of a quadratic objective in a separable Hilbert space, and show that with early stopping, the choice of learning rate influences the spectral decomposition of the obtained solution.
- Score: 110.03020563291788
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper studies an intriguing phenomenon related to the good
generalization performance of estimators obtained by using large learning rates
within gradient descent algorithms. First observed in the deep learning
literature, we show that a phenomenon can be precisely characterized in the
context of kernel methods, even though the resulting optimization problem is
convex. Specifically, we consider the minimization of a quadratic objective in
a separable Hilbert space, and show that with early stopping, the choice of
learning rate influences the spectral decomposition of the obtained solution on
the Hessian's eigenvectors. This extends an intuition described by Nakkiran
(2020) on a two-dimensional toy problem to realistic learning scenarios such as
kernel ridge regression. While large learning rates may be proven beneficial as
soon as there is a mismatch between the train and test objectives, we further
explain why it already occurs in classification tasks without assuming any
particular mismatch between train and test data distributions.
Related papers
- A Historical Trajectory Assisted Optimization Method for Zeroth-Order Federated Learning [24.111048817721592]
Federated learning heavily relies on distributed gradient descent techniques.
In the situation where gradient information is not available, gradients need to be estimated from zeroth-order information.
We propose a non-isotropic sampling method to improve the gradient estimation procedure.
arXiv Detail & Related papers (2024-09-24T10:36:40Z) - Stochastic Gradient Descent for Gaussian Processes Done Right [86.83678041846971]
We show that when emphdone right -- by which we mean using specific insights from optimisation and kernel communities -- gradient descent is highly effective.
We introduce a emphstochastic dual descent algorithm, explain its design in an intuitive manner and illustrate the design choices.
Our method places Gaussian process regression on par with state-of-the-art graph neural networks for molecular binding affinity prediction.
arXiv Detail & Related papers (2023-10-31T16:15:13Z) - Promises and Pitfalls of the Linearized Laplace in Bayesian Optimization [73.80101701431103]
The linearized-Laplace approximation (LLA) has been shown to be effective and efficient in constructing Bayesian neural networks.
We study the usefulness of the LLA in Bayesian optimization and highlight its strong performance and flexibility.
arXiv Detail & Related papers (2023-04-17T14:23:43Z) - Scalable Bayesian Meta-Learning through Generalized Implicit Gradients [64.21628447579772]
Implicit Bayesian meta-learning (iBaML) method broadens the scope of learnable priors, but also quantifies the associated uncertainty.
Analytical error bounds are established to demonstrate the precision and efficiency of the generalized implicit gradient over the explicit one.
arXiv Detail & Related papers (2023-03-31T02:10:30Z) - Pairwise Learning via Stagewise Training in Proximal Setting [0.0]
We combine adaptive sample size and importance sampling techniques for pairwise learning, with convergence guarantees for nonsmooth convex pairwise loss functions.
We demonstrate that sampling opposite instances at each reduces the variance of the gradient, hence accelerating convergence.
arXiv Detail & Related papers (2022-08-08T11:51:01Z) - Learning primal-dual sparse kernel machines [10.230121160034674]
Traditionally, kernel methods rely on the representer theorem which states that the solution to a learning problem is obtained as a linear combination of the data mapped into the reproducing kernel Hilbert space (RKHS)
We propose to search for a solution in RKHS that has a pre-image decomposition in the original data space, where the elements don't necessarily correspond to the elements in the training set.
Our gradient-based method then hinges on optimisation over possibly sparse elements in the input space, and enables us to obtain a kernel-based model with both primal and dual sparsity.
arXiv Detail & Related papers (2021-08-27T09:38:53Z) - On the Role of Optimization in Double Descent: A Least Squares Study [30.44215064390409]
We show an excess risk bound for the descent gradient solution of the least squares objective.
We find that in case of noiseless regression, double descent is explained solely by optimization-related quantities.
We empirically explore if our predictions hold for neural networks.
arXiv Detail & Related papers (2021-07-27T09:13:11Z) - From inexact optimization to learning via gradient concentration [22.152317081922437]
In this paper, we investigate the phenomenon in the context of linear models with smooth loss functions.
We propose a proof technique combining ideas from inexact optimization and probability theory, specifically gradient concentration.
arXiv Detail & Related papers (2021-06-09T21:23:29Z) - Deep learning: a statistical viewpoint [120.94133818355645]
Deep learning has revealed some major surprises from a theoretical perspective.
In particular, simple gradient methods easily find near-perfect solutions to non-optimal training problems.
We conjecture that specific principles underlie these phenomena.
arXiv Detail & Related papers (2021-03-16T16:26:36Z) - On Learning Rates and Schr\"odinger Operators [105.32118775014015]
We present a general theoretical analysis of the effect of the learning rate.
We find that the learning rate tends to zero for a broad non- neural class functions.
arXiv Detail & Related papers (2020-04-15T09:52:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.