On Learning Rates and Schr\"odinger Operators
- URL: http://arxiv.org/abs/2004.06977v1
- Date: Wed, 15 Apr 2020 09:52:37 GMT
- Title: On Learning Rates and Schr\"odinger Operators
- Authors: Bin Shi, Weijie J. Su, Michael I. Jordan
- Abstract summary: We present a general theoretical analysis of the effect of the learning rate.
We find that the learning rate tends to zero for a broad non- neural class functions.
- Score: 105.32118775014015
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The learning rate is perhaps the single most important parameter in the
training of neural networks and, more broadly, in stochastic (nonconvex)
optimization. Accordingly, there are numerous effective, but poorly understood,
techniques for tuning the learning rate, including learning rate decay, which
starts with a large initial learning rate that is gradually decreased. In this
paper, we present a general theoretical analysis of the effect of the learning
rate in stochastic gradient descent (SGD). Our analysis is based on the use of
a learning-rate-dependent stochastic differential equation (lr-dependent SDE)
that serves as a surrogate for SGD. For a broad class of objective functions,
we establish a linear rate of convergence for this continuous-time formulation
of SGD, highlighting the fundamental importance of the learning rate in SGD,
and contrasting to gradient descent and stochastic gradient Langevin dynamics.
Moreover, we obtain an explicit expression for the optimal linear rate by
analyzing the spectrum of the Witten-Laplacian, a special case of the
Schr\"odinger operator associated with the lr-dependent SDE. Strikingly, this
expression clearly reveals the dependence of the linear convergence rate on the
learning rate -- the linear rate decreases rapidly to zero as the learning rate
tends to zero for a broad class of nonconvex functions, whereas it stays
constant for strongly convex functions. Based on this sharp distinction between
nonconvex and convex problems, we provide a mathematical interpretation of the
benefits of using learning rate decay for nonconvex optimization.
Related papers
- Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training.
Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z) - Non-Parametric Learning of Stochastic Differential Equations with Non-asymptotic Fast Rates of Convergence [65.63201894457404]
We propose a novel non-parametric learning paradigm for the identification of drift and diffusion coefficients of non-linear differential equations.
The key idea essentially consists of fitting a RKHS-based approximation of the corresponding Fokker-Planck equation to such observations.
arXiv Detail & Related papers (2023-05-24T20:43:47Z) - On the Benefits of Large Learning Rates for Kernel Methods [110.03020563291788]
We show that a phenomenon can be precisely characterized in the context of kernel methods.
We consider the minimization of a quadratic objective in a separable Hilbert space, and show that with early stopping, the choice of learning rate influences the spectral decomposition of the obtained solution.
arXiv Detail & Related papers (2022-02-28T13:01:04Z) - On the Hyperparameters in Stochastic Gradient Descent with Momentum [6.396288020763144]
We present the theoretical analysis for gradient descent with momentum (SGD) in this paper.
By we, and the final comparison is introduced, we show why the optimal linear rate for SGD only about the surrogate rate varies with increasing from zero to when the rate increases.
Finally, we show the surrogate momentum under the rate has no essential difference with the momentum.
arXiv Detail & Related papers (2021-08-09T11:25:03Z) - Benefit of deep learning with non-convex noisy gradient descent:
Provable excess risk bound and superiority to kernel methods [41.60125423028092]
We show that any linear estimator can be outperformed by deep learning in a sense of minimax optimal rate.
The excess bounds are so-called fast learning rate which is faster than $O bounds.
arXiv Detail & Related papers (2020-12-06T09:22:16Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Reconciling Modern Deep Learning with Traditional Optimization Analyses:
The Intrinsic Learning Rate [36.83448475700536]
Recent works suggest that the use of Batch Normalization in today's deep learning can move it far from a traditional optimization viewpoint.
This paper highlights other ways in which behavior of normalized nets departs from traditional viewpoints.
We name it the Fast Equilibrium Conjecture and suggest it holds the key to why Batch Normalization is effective.
arXiv Detail & Related papers (2020-10-06T17:58:29Z) - Optimal Rates for Averaged Stochastic Gradient Descent under Neural
Tangent Kernel Regime [50.510421854168065]
We show that the averaged gradient descent can achieve the minimax optimal convergence rate.
We show that the target function specified by the NTK of a ReLU network can be learned at the optimal convergence rate.
arXiv Detail & Related papers (2020-06-22T14:31:37Z) - Sparse Perturbations for Improved Convergence in Stochastic Zeroth-Order
Optimization [10.907491258280608]
Interest in zeroth-order (SZO) methods has recently been revived in black-box optimization scenarios such as adversarial black-box attacks to deep neural networks.
SZO methods only require the ability to evaluate the objective function at random input points.
We present a SZO optimization method that reduces the dependency on the dimensionality of the random perturbation to be evaluated.
arXiv Detail & Related papers (2020-06-02T16:39:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.