Universal scaling laws in the gradient descent training of neural
networks
- URL: http://arxiv.org/abs/2105.00507v1
- Date: Sun, 2 May 2021 16:46:38 GMT
- Title: Universal scaling laws in the gradient descent training of neural
networks
- Authors: Maksim Velikanov and Dmitry Yarotsky
- Abstract summary: We show that the learning trajectory can be characterized by an explicit bounds at large training times.
Our results are based on spectral analysis of the evolution of a large network trained on the expected loss.
- Score: 10.508187462682308
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current theoretical results on optimization trajectories of neural networks
trained by gradient descent typically have the form of rigorous but potentially
loose bounds on the loss values. In the present work we take a different
approach and show that the learning trajectory can be characterized by an
explicit asymptotic at large training times. Specifically, the leading term in
the asymptotic expansion of the loss behaves as a power law $L(t) \sim
t^{-\xi}$ with exponent $\xi$ expressed only through the data dimension, the
smoothness of the activation function, and the class of function being
approximated. Our results are based on spectral analysis of the integral
operator representing the linearized evolution of a large network trained on
the expected loss. Importantly, the techniques we employ do not require
specific form of a data distribution, for example Gaussian, thus making our
findings sufficiently universal.
Related papers
- A Random Matrix Theory Perspective on the Spectrum of Learned Features and Asymptotic Generalization Capabilities [30.737171081270322]
We study how fully-connected two-layer neural networks adapt to the target function after a single, but aggressive, gradient descent step.
This provides a sharp description of the impact of feature learning in the generalization of two-layer neural networks, beyond the random features and lazy training regimes.
arXiv Detail & Related papers (2024-10-24T17:24:34Z) - Feature learning in finite-width Bayesian deep linear networks with multiple outputs and convolutional layers [39.71511919246829]
Deep linear networks have been extensively studied, but little is known in the case of finite-width architectures with multiple outputs and convolutional layers.
Our work provides a dictionary that translates this physics intuition and terminology into rigorous Bayesian statistics.
arXiv Detail & Related papers (2024-06-05T13:37:42Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - Fast Convergence in Learning Two-Layer Neural Networks with Separable
Data [37.908159361149835]
We study normalized gradient descent on two-layer neural nets.
We prove for exponentially-tailed losses that using normalized GD leads to linear rate of convergence of the training loss to the global optimum.
arXiv Detail & Related papers (2023-05-22T20:30:10Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - The Implicit Bias of Gradient Descent on Generalized Gated Linear
Networks [3.3946853660795893]
We derive the infinite-time training limit of a mathematically tractable class of deep nonlinear neural networks (GLNs)
We show how architectural constraints and the implicit bias of gradient descent affect performance.
By making the inductive bias explicit, our framework is poised to inform the development of more efficient, biologically plausible, and robust learning algorithms.
arXiv Detail & Related papers (2022-02-05T22:37:39Z) - Subquadratic Overparameterization for Shallow Neural Networks [60.721751363271146]
We provide an analytical framework that allows us to adopt standard neural training strategies.
We achieve the desiderata viaak-Lojasiewicz, smoothness, and standard assumptions.
arXiv Detail & Related papers (2021-11-02T20:24:01Z) - Fractal Structure and Generalization Properties of Stochastic
Optimization Algorithms [71.62575565990502]
We prove that the generalization error of an optimization algorithm can be bounded on the complexity' of the fractal structure that underlies its generalization measure.
We further specialize our results to specific problems (e.g., linear/logistic regression, one hidden/layered neural networks) and algorithms.
arXiv Detail & Related papers (2021-06-09T08:05:36Z) - How Neural Networks Extrapolate: From Feedforward to Graph Neural
Networks [80.55378250013496]
We study how neural networks trained by gradient descent extrapolate what they learn outside the support of the training distribution.
Graph Neural Networks (GNNs) have shown some success in more complex tasks.
arXiv Detail & Related papers (2020-09-24T17:48:59Z) - Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks
Trained with the Logistic Loss [0.0]
Neural networks trained to minimize the logistic (a.k.a. cross-entropy) loss with gradient-based methods are observed to perform well in many supervised classification tasks.
We analyze the training and generalization behavior of infinitely wide two-layer neural networks with homogeneous activations.
arXiv Detail & Related papers (2020-02-11T15:42:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.