Related papers: Plateau Phenomenon in Gradient Descent Training of ReLU networks: Explanation, Quantification and Avoidance

Plateau Phenomenon in Gradient Descent Training of ReLU networks: Explanation, Quantification and Avoidance

URL: http://arxiv.org/abs/2007.07213v1
Date: Tue, 14 Jul 2020 17:33:26 GMT
Title: Plateau Phenomenon in Gradient Descent Training of ReLU networks: Explanation, Quantification and Avoidance
Authors: Mark Ainsworth and Yeonjong Shin
Abstract summary: In general, neural networks are trained by gradient type optimization methods. The loss function decreases rapidly at the beginning of training but then, after a relatively small number of steps, significantly slow down. The present work aims to identify and quantify the root causes of plateau phenomenon.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The ability of neural networks to provide `best in class' approximation across a wide range of applications is well-documented. Nevertheless, the powerful expressivity of neural networks comes to naught if one is unable to effectively train (choose) the parameters defining the network. In general, neural networks are trained by gradient descent type optimization methods, or a stochastic variant thereof. In practice, such methods result in the loss function decreases rapidly at the beginning of training but then, after a relatively small number of steps, significantly slow down. The loss may even appear to stagnate over the period of a large number of epochs, only to then suddenly start to decrease fast again for no apparent reason. This so-called plateau phenomenon manifests itself in many learning tasks. The present work aims to identify and quantify the root causes of plateau phenomenon. No assumptions are made on the number of neurons relative to the number of training data, and our results hold for both the lazy and adaptive regimes. The main findings are: plateaux correspond to periods during which activation patterns remain constant, where activation pattern refers to the number of data points that activate a given neuron; quantification of convergence of the gradient flow dynamics; and, characterization of stationary points in terms solutions of local least squares regression lines over subsets of the training data. Based on these conclusions, we propose a new iterative training method, the Active Neuron Least Squares (ANLS), characterised by the explicit adjustment of the activation pattern at each step, which is designed to enable a quick exit from a plateau. Illustrative numerical examples are included throughout.

Related papers

Take A Shortcut Back: Mitigating the Gradient Vanishing for Training Spiking Neural Networks [15.691263438655842]
Spiking Neural Network (SNN) is a biologically inspired neural network infrastructure that has recently garnered significant attention. Training an SNN directly poses a challenge due to the undefined gradient of the firing spike process. We propose a shortcut back-propagation method in our paper, which advocates for transmitting the gradient directly from the loss to the shallow layers.
arXiv Detail & Related papers (2024-01-09T10:54:41Z)
Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU Networks on Nearly-orthogonal Data [66.1211659120882]
The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well. While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks.
arXiv Detail & Related papers (2023-10-29T08:47:48Z)
Globally Optimal Training of Neural Networks with Threshold Activation Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations. We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z)
Learning time-scales in two-layers neural networks [11.878594839685471]
We study the gradient flow dynamics of a wide two-layer neural network in high-dimension. Based on new rigorous results, we propose a scenario for the learning dynamics in this setting.
arXiv Detail & Related papers (2023-02-28T19:52:26Z)
Theoretical Characterization of How Neural Network Pruning Affects its Generalization [131.1347309639727]
This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization. It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero. More surprisingly, the generalization bound gets better as the pruning fraction gets larger.
arXiv Detail & Related papers (2023-01-01T03:10:45Z)
Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations. For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two. For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z)
Early Stage Convergence and Global Convergence of Training Mildly Parameterized Neural Networks [3.148524502470734]
We show that the loss is decreased by a significant amount in the early stage of the training, and this decrease is fast. We use a microscopic analysis of the activation patterns for the neurons, which helps us derive more powerful lower bounds for the gradient.
arXiv Detail & Related papers (2022-06-05T09:56:50Z)
Dynamic Neural Diversification: Path to Computationally Sustainable Neural Networks [68.8204255655161]
Small neural networks with a constrained number of trainable parameters, can be suitable resource-efficient candidates for many simple tasks. We explore the diversity of the neurons within the hidden layer during the learning process. We analyze how the diversity of the neurons affects predictions of the model.
arXiv Detail & Related papers (2021-09-20T15:12:16Z)
Gradient Starvation: A Learning Proclivity in Neural Networks [97.02382916372594]
Gradient Starvation arises when cross-entropy loss is minimized by capturing only a subset of features relevant for the task. This work provides a theoretical explanation for the emergence of such feature imbalance in neural networks.
arXiv Detail & Related papers (2020-11-18T18:52:08Z)
Over-parametrized neural networks as under-determined linear systems [31.69089186688224]
We show that it is unsurprising simple neural networks can achieve zero training loss. We show that kernels typically associated with the ReLU activation function have fundamental flaws. We propose new activation functions that avoid the pitfalls of ReLU in that they admit zero training loss solutions for any set of distinct data points.
arXiv Detail & Related papers (2020-10-29T21:43:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.