Differentiable Self-Adaptive Learning Rate
- URL: http://arxiv.org/abs/2210.10290v1
- Date: Wed, 19 Oct 2022 04:28:57 GMT
- Title: Differentiable Self-Adaptive Learning Rate
- Authors: Bozhou Chen, Hongzhi Wang, Chenmin Ba
- Abstract summary: We propose a novel adaptation algorithm, where learning rate is parameter specific and internal structured.
It is shown that our algorithm can achieve faster and higher qualified convergence than those state-of-art datasets.
- Score: 4.443170466488981
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning rate adaptation is a popular topic in machine learning. Gradient
Descent trains neural nerwork with a fixed learning rate. Learning rate
adaptation is proposed to accelerate the training process through adjusting the
step size in the training session. Famous works include Momentum, Adam and
Hypergradient. Hypergradient is the most special one. Hypergradient achieved
adaptation by calculating the derivative of learning rate with respect to cost
function and utilizing gradient descent for learning rate. However,
Hypergradient is still not perfect. In practice, Hypergradient fail to decrease
training loss after learning rate adaptation with a large probability. Apart
from that, evidence has been found that Hypergradient are not suitable for
dealing with large datesets in the form of minibatch training. Most
unfortunately, Hypergradient always fails to get a good accuracy on the
validation dataset although it could reduce training loss to a very tiny value.
To solve Hypergradient's problems, we propose a novel adaptation algorithm,
where learning rate is parameter specific and internal structured. We conduct
extensive experiments on multiple network models and datasets compared with
various benchmark optimizers. It is shown that our algorithm can achieve faster
and higher qualified convergence than those state-of-art optimizers.
Related papers
- Newton Losses: Using Curvature Information for Learning with Differentiable Algorithms [80.37846867546517]
We show how to train eight different neural networks with custom objectives.
We exploit their second-order information via their empirical Fisherssian matrices.
We apply Loss Lossiable algorithms to achieve significant improvements for less differentiable algorithms.
arXiv Detail & Related papers (2024-10-24T18:02:11Z) - Evolutionary Retrofitting [42.21143557577615]
AfterLearnER consists in applying non-differentiable optimization, including evolutionary methods, to fully-trained machine learning models.
The efficiency of AfterLearnER is demonstrated by tackling non-differentiable signals such as threshold-based criteria in depth sensing, the word error rate in speech re-synthesis, image quality in 3D generative adversarial networks (GANs)
The advantages of AfterLearnER are its versatility (no gradient is needed), the possibility to use non-differentiable feedback including human evaluations, the limited overfitting, supported by a theoretical study and its anytime behavior.
arXiv Detail & Related papers (2024-10-15T06:59:32Z) - Just How Flexible are Neural Networks in Practice? [89.80474583606242]
It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters.
In practice, however, we only find solutions via our training procedure, including the gradient and regularizers, limiting flexibility.
arXiv Detail & Related papers (2024-06-17T12:24:45Z) - Training Acceleration of Low-Rank Decomposed Networks using Sequential
Freezing and Rank Quantization [5.914653351242832]
We propose two techniques for accelerating low rank decomposed models without requiring to use small ranks for decomposition.
These methods include rank optimization and sequential freezing of layers.
Experiments show that these techniques can improve the model throughput up to 60% during training and 37% during inference when combined together.
arXiv Detail & Related papers (2023-09-07T16:33:42Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Locally Regularized Neural Differential Equations: Some Black Boxes Were
Meant to Remain Closed! [3.222802562733787]
Implicit layer deep learning techniques, like Neural Differential Equations, have become an important modeling framework.
We develop two sampling strategies to trade off between performance and training time.
Our method reduces the number of function evaluations to 0.556-0.733x and accelerates predictions by 1.3-2x.
arXiv Detail & Related papers (2023-03-03T23:31:15Z) - Balance is Essence: Accelerating Sparse Training via Adaptive Gradient
Correction [29.61757744974324]
Deep neural networks require significant memory and computation costs.
Sparse training is one of the most common techniques to reduce these costs.
In this work, we aim to overcome this problem and achieve space-time co-efficiency.
arXiv Detail & Related papers (2023-01-09T18:50:03Z) - Scaling Forward Gradient With Local Losses [117.22685584919756]
Forward learning is a biologically plausible alternative to backprop for learning deep neural networks.
We show that it is possible to substantially reduce the variance of the forward gradient by applying perturbations to activations rather than weights.
Our approach matches backprop on MNIST and CIFAR-10 and significantly outperforms previously proposed backprop-free algorithms on ImageNet.
arXiv Detail & Related papers (2022-10-07T03:52:27Z) - Tom: Leveraging trend of the observed gradients for faster convergence [0.0]
Tom is a novel variant of Adam that takes into account the trend observed for the gradients in the landscape in the loss traversed by the neural network.
Tom outperforms Adagrad, Adadelta, RMSProp and Adam in terms of both accuracy and has a faster convergence.
arXiv Detail & Related papers (2021-09-07T20:19:40Z) - Adapting Stepsizes by Momentumized Gradients Improves Optimization and
Generalization [89.66571637204012]
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
arXiv Detail & Related papers (2021-06-22T03:13:23Z) - Predicting Training Time Without Training [120.92623395389255]
We tackle the problem of predicting the number of optimization steps that a pre-trained deep network needs to converge to a given value of the loss function.
We leverage the fact that the training dynamics of a deep network during fine-tuning are well approximated by those of a linearized model.
We are able to predict the time it takes to fine-tune a model to a given loss without having to perform any training.
arXiv Detail & Related papers (2020-08-28T04:29:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.