The Golden Ratio of Learning and Momentum
- URL: http://arxiv.org/abs/2006.04751v1
- Date: Mon, 8 Jun 2020 17:08:13 GMT
- Title: The Golden Ratio of Learning and Momentum
- Authors: Stefan Jaeger
- Abstract summary: This paper proposes a new information-theoretical loss function motivated by neural signal processing in a synapse.
All results taken together show that loss, learning rate, and momentum are closely connected.
- Score: 0.5076419064097732
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Gradient descent has been a central training principle for artificial neural
networks from the early beginnings to today's deep learning networks. The most
common implementation is the backpropagation algorithm for training
feed-forward neural networks in a supervised fashion. Backpropagation involves
computing the gradient of a loss function, with respect to the weights of the
network, to update the weights and thus minimize loss. Although the mean square
error is often used as a loss function, the general stochastic gradient descent
principle does not immediately connect with a specific loss function. Another
drawback of backpropagation has been the search for optimal values of two
important training parameters, learning rate and momentum weight, which are
determined empirically in most systems. The learning rate specifies the step
size towards a minimum of the loss function when following the gradient, while
the momentum weight considers previous weight changes when updating current
weights. Using both parameters in conjunction with each other is generally
accepted as a means to improving training, although their specific values do
not follow immediately from standard backpropagation theory. This paper
proposes a new information-theoretical loss function motivated by neural signal
processing in a synapse. The new loss function implies a specific learning rate
and momentum weight, leading to empirical parameters often used in practice.
The proposed framework also provides a more formal explanation of the momentum
term and its smoothing effect on the training process. All results taken
together show that loss, learning rate, and momentum are closely connected. To
support these theoretical findings, experiments for handwritten digit
recognition show the practical usefulness of the proposed loss function and
training parameters.
Related papers
- On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - Learning fixed points of recurrent neural networks by reparameterizing
the network model [0.0]
In computational neuroscience, fixed points of recurrent neural networks are commonly used to model neural responses to static or slowly changing stimuli.
A natural approach is to use gradient descent on the Euclidean space of synaptic weights.
We show that this approach can lead to poor learning performance due to singularities that arise in the loss surface.
arXiv Detail & Related papers (2023-07-13T13:09:11Z) - Weight Compander: A Simple Weight Reparameterization for Regularization [5.744133015573047]
We introduce weight compander, a novel effective method to improve generalization of deep neural networks.
We show experimentally that using weight compander in addition to standard regularization methods improves the performance of neural networks.
arXiv Detail & Related papers (2023-06-29T14:52:04Z) - Alternate Loss Functions for Classification and Robust Regression Can Improve the Accuracy of Artificial Neural Networks [6.452225158891343]
This paper shows that training speed and final accuracy of neural networks can significantly depend on the loss function used to train neural networks.
Two new classification loss functions that significantly improve performance on a wide variety of benchmark tasks are proposed.
arXiv Detail & Related papers (2023-03-17T12:52:06Z) - Online Loss Function Learning [13.744076477599707]
Loss function learning aims to automate the task of designing a loss function for a machine learning model.
We propose a new loss function learning technique for adaptively updating the loss function online after each update to the base model parameters.
arXiv Detail & Related papers (2023-01-30T19:22:46Z) - Theoretical Characterization of How Neural Network Pruning Affects its
Generalization [131.1347309639727]
This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization.
It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero.
More surprisingly, the generalization bound gets better as the pruning fraction gets larger.
arXiv Detail & Related papers (2023-01-01T03:10:45Z) - Understanding Square Loss in Training Overparametrized Neural Network
Classifiers [31.319145959402462]
We contribute to the theoretical understanding of square loss in classification by systematically investigating how it performs for overparametrized neural networks.
We consider two cases, according to whether classes are separable or not. In the general non-separable case, fast convergence rate is established for both misclassification rate and calibration error.
The resulting margin is proven to be lower bounded away from zero, providing theoretical guarantees for robustness.
arXiv Detail & Related papers (2021-12-07T12:12:30Z) - MTAdam: Automatic Balancing of Multiple Training Loss Terms [95.99508450208813]
We generalize the Adam optimization algorithm to handle multiple loss terms.
We show that training with the new method leads to fast recovery from suboptimal initial loss weighting.
arXiv Detail & Related papers (2020-06-25T20:27:27Z) - Feature Purification: How Adversarial Training Performs Robust Deep
Learning [66.05472746340142]
We show a principle that we call Feature Purification, where we show one of the causes of the existence of adversarial examples is the accumulation of certain small dense mixtures in the hidden weights during the training process of a neural network.
We present both experiments on the CIFAR-10 dataset to illustrate this principle, and a theoretical result proving that for certain natural classification tasks, training a two-layer neural network with ReLU activation using randomly gradient descent indeed this principle.
arXiv Detail & Related papers (2020-05-20T16:56:08Z) - Revisiting Initialization of Neural Networks [72.24615341588846]
We propose a rigorous estimation of the global curvature of weights across layers by approximating and controlling the norm of their Hessian matrix.
Our experiments on Word2Vec and the MNIST/CIFAR image classification tasks confirm that tracking the Hessian norm is a useful diagnostic tool.
arXiv Detail & Related papers (2020-04-20T18:12:56Z) - The Break-Even Point on Optimization Trajectories of Deep Neural
Networks [64.7563588124004]
We argue for the existence of the "break-even" point on this trajectory.
We show that using a large learning rate in the initial phase of training reduces the variance of the gradient.
We also show that using a low learning rate results in bad conditioning of the loss surface even for a neural network with batch normalization layers.
arXiv Detail & Related papers (2020-02-21T22:55:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.