Maximal Initial Learning Rates in Deep ReLU Networks
- URL: http://arxiv.org/abs/2212.07295v2
- Date: Fri, 26 May 2023 01:39:33 GMT
- Title: Maximal Initial Learning Rates in Deep ReLU Networks
- Authors: Gaurav Iyer, Boris Hanin, David Rolnick
- Abstract summary: We introduce the maximal initial learning rate $etaast$.
We observe that in constant-width fully-connected ReLU networks, $etaast$ behaves differently from the maximum learning rate later in training.
- Score: 32.157430904535126
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training a neural network requires choosing a suitable learning rate, which
involves a trade-off between speed and effectiveness of convergence. While
there has been considerable theoretical and empirical analysis of how large the
learning rate can be, most prior work focuses only on late-stage training. In
this work, we introduce the maximal initial learning rate $\eta^{\ast}$ - the
largest learning rate at which a randomly initialized neural network can
successfully begin training and achieve (at least) a given threshold accuracy.
Using a simple approach to estimate $\eta^{\ast}$, we observe that in
constant-width fully-connected ReLU networks, $\eta^{\ast}$ behaves differently
from the maximum learning rate later in training. Specifically, we find that
$\eta^{\ast}$ is well predicted as a power of depth $\times$ width, provided
that (i) the width of the network is sufficiently large compared to the depth,
and (ii) the input layer is trained at a relatively small learning rate. We
further analyze the relationship between $\eta^{\ast}$ and the sharpness
$\lambda_{1}$ of the network at initialization, indicating they are closely
though not inversely related. We formally prove bounds for $\lambda_{1}$ in
terms of depth $\times$ width that align with our empirical results.
Related papers
- The Optimization Landscape of SGD Across the Feature Learning Strength [102.1353410293931]
We study the effect of scaling $gamma$ across a variety of models and datasets in the online training setting.
We find that optimal online performance is often found at large $gamma$.
Our findings indicate that analytical study of the large-$gamma$ limit may yield useful insights into the dynamics of representation learning in performant models.
arXiv Detail & Related papers (2024-10-06T22:30:14Z) - Bayesian Inference with Deep Weakly Nonlinear Networks [57.95116787699412]
We show at a physics level of rigor that Bayesian inference with a fully connected neural network is solvable.
We provide techniques to compute the model evidence and posterior to arbitrary order in $1/N$ and at arbitrary temperature.
arXiv Detail & Related papers (2024-05-26T17:08:04Z) - Rates of Approximation by ReLU Shallow Neural Networks [8.22379888383833]
We show that ReLU shallow neural networks with $m$ hidden neurons can uniformly approximate functions from the H"older space.
Such rates are very close to the optimal one $O(m-fracrd)$ in the sense that $fracd+2d+4d+4$ is close to $1$, when the dimension $d$ is large.
arXiv Detail & Related papers (2023-07-24T00:16:50Z) - Wide neural networks: From non-gaussian random fields at initialization
to the NTK geometry of training [0.0]
Recent developments in applications of artificial neural networks with over $n=1014$ parameters make it extremely important to study the large $n$ behaviour of such networks.
Most works studying wide neural networks have focused on the infinite width $n to +infty$ limit of such networks.
In this work we will study their behavior for large, but finite $n$.
arXiv Detail & Related papers (2023-04-06T21:34:13Z) - Understanding Deep Neural Function Approximation in Reinforcement
Learning via $\epsilon$-Greedy Exploration [53.90873926758026]
This paper provides a theoretical study of deep neural function approximation in reinforcement learning (RL)
We focus on the value based algorithm with the $epsilon$-greedy exploration via deep (and two-layer) neural networks endowed by Besov (and Barron) function spaces.
Our analysis reformulates the temporal difference error in an $L2(mathrmdmu)$-integrable space over a certain averaged measure $mu$, and transforms it to a generalization problem under the non-iid setting.
arXiv Detail & Related papers (2022-09-15T15:42:47Z) - Neural Capacitance: A New Perspective of Neural Network Selection via
Edge Dynamics [85.31710759801705]
Current practice requires expensive computational costs in model training for performance prediction.
We propose a novel framework for neural network selection by analyzing the governing dynamics over synaptic connections (edges) during training.
Our framework is built on the fact that back-propagation during neural network training is equivalent to the dynamical evolution of synaptic connections.
arXiv Detail & Related papers (2022-01-11T20:53:15Z) - Does Preprocessing Help Training Over-parameterized Neural Networks? [19.64638346701198]
We propose two novel preprocessing ideas to bypass the $Omega(mnd)$ barrier.
Our results provide theoretical insights for a large number of previously established fast training methods.
arXiv Detail & Related papers (2021-10-09T18:16:23Z) - Towards Deep Learning Models Resistant to Large Perturbations [0.0]
Adversarial robustness has proven to be a required property of machine learning algorithms.
We show that the well-established algorithm called "adversarial training" fails to train a deep neural network given a large, but reasonable, perturbation magnitude.
arXiv Detail & Related papers (2020-03-30T12:03:09Z) - Taylorized Training: Towards Better Approximation of Neural Network
Training at Finite Width [116.69845849754186]
Taylorized training involves training the $k$-th order Taylor expansion of the neural network.
We show that Taylorized training agrees with full neural network training increasingly better as we increase $k$.
We complement our experiments with theoretical results showing that the approximation error of $k$-th order Taylorized models decay exponentially over $k$ in wide neural networks.
arXiv Detail & Related papers (2020-02-10T18:37:04Z) - Backward Feature Correction: How Deep Learning Performs Deep
(Hierarchical) Learning [66.05472746340142]
This paper analyzes how multi-layer neural networks can perform hierarchical learning _efficiently_ and _automatically_ by SGD on the training objective.
We establish a new principle called "backward feature correction", where the errors in the lower-level features can be automatically corrected when training together with the higher-level layers.
arXiv Detail & Related papers (2020-01-13T17:28:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.