A Framework for Provably Stable and Consistent Training of Deep
Feedforward Networks
- URL: http://arxiv.org/abs/2305.12125v1
- Date: Sat, 20 May 2023 07:18:06 GMT
- Title: A Framework for Provably Stable and Consistent Training of Deep
Feedforward Networks
- Authors: Arunselvan Ramaswamy, Shalabh Bhatnagar, Naman Saxena
- Abstract summary: We present a novel algorithm for training deep neural networks in supervised (classification and regression) and unsupervised (reinforcement learning) scenarios.
This algorithm combines the standard descent gradient and the gradient clipping method.
We show, in theory and through experiments, that our algorithm updates have low variance, and the training loss reduces in a smooth manner.
- Score: 4.21061712600981
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a novel algorithm for training deep neural networks in supervised
(classification and regression) and unsupervised (reinforcement learning)
scenarios. This algorithm combines the standard stochastic gradient descent and
the gradient clipping method. The output layer is updated using clipped
gradients, the rest of the neural network is updated using standard gradients.
Updating the output layer using clipped gradient stabilizes it. We show that
the remaining layers are automatically stabilized provided the neural network
is only composed of squashing (compact range) activations. We also present a
novel squashing activation function - it is obtained by modifying a Gaussian
Error Linear Unit (GELU) to have compact range - we call it Truncated GELU
(tGELU). Unlike other squashing activations, such as sigmoid, the range of
tGELU can be explicitly specified. As a consequence, the problem of vanishing
gradients that arise due to a small range, e.g., in the case of a sigmoid
activation, is eliminated. We prove that a NN composed of squashing activations
(tGELU, sigmoid, etc.), when updated using the algorithm presented herein, is
numerically stable and has consistent performance (low variance). The theory is
supported by extensive experiments. Within reinforcement learning, as a
consequence of our study, we show that target networks in Deep Q-Learning can
be omitted, greatly speeding up learning and alleviating memory requirements.
Cross-entropy based classification algorithms that suffer from high variance
issues are more consistent when trained using our framework. One symptom of
numerical instability in training is the high variance of the neural network
update values. We show, in theory and through experiments, that our algorithm
updates have low variance, and the training loss reduces in a smooth manner.
Related papers
- Rethinking PGD Attack: Is Sign Function Necessary? [131.6894310945647]
We present a theoretical analysis of how such sign-based update algorithm influences step-wise attack performance.
We propose a new raw gradient descent (RGD) algorithm that eliminates the use of sign.
The effectiveness of the proposed RGD algorithm has been demonstrated extensively in experiments.
arXiv Detail & Related papers (2023-12-03T02:26:58Z) - Implicit Bias of Gradient Descent for Two-layer ReLU and Leaky ReLU
Networks on Nearly-orthogonal Data [66.1211659120882]
The implicit bias towards solutions with favorable properties is believed to be a key reason why neural networks trained by gradient-based optimization can generalize well.
While the implicit bias of gradient flow has been widely studied for homogeneous neural networks (including ReLU and leaky ReLU networks), the implicit bias of gradient descent is currently only understood for smooth neural networks.
arXiv Detail & Related papers (2023-10-29T08:47:48Z) - Benign Overfitting for Two-layer ReLU Convolutional Neural Networks [60.19739010031304]
We establish algorithm-dependent risk bounds for learning two-layer ReLU convolutional neural networks with label-flipping noise.
We show that, under mild conditions, the neural network trained by gradient descent can achieve near-zero training loss and Bayes optimal test risk.
arXiv Detail & Related papers (2023-03-07T18:59:38Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - An Experimental Comparison Between Temporal Difference and Residual
Gradient with Neural Network Approximation [8.166265682999482]
In deep Q-learning with neural network approximation, gradient descent is barely used to solve Bellman residual minimization problem.
In this work, we perform extensive experiments to show that Temporal Difference (TD) outperforms gradient descent (RG)
We also empirically examine that the missing term in TD is a key reason why RG performs badly.
arXiv Detail & Related papers (2022-05-25T13:37:52Z) - Learning Quantized Neural Nets by Coarse Gradient Method for Non-linear
Classification [3.158346511479111]
We propose a class of STEs with certain monotonicity, and consider their applications to the training of a two-linear-layer network with quantized activation functions.
We establish performance guarantees for the proposed STEs by showing that the corresponding coarse gradient methods converge to the global minimum.
arXiv Detail & Related papers (2020-11-23T07:50:09Z) - Superpolynomial Lower Bounds for Learning One-Layer Neural Networks
using Gradient Descent [25.589302381660453]
We show that any trained using gradient descent with respect to square-loss distribution will fail to achieve small test error in time.
For classification, we give a stronger result, namely that any statistical query (SQ) will fail to achieve small test error in time.
arXiv Detail & Related papers (2020-06-22T05:15:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.