Training Aware Sigmoidal Optimizer
- URL: http://arxiv.org/abs/2102.08716v1
- Date: Wed, 17 Feb 2021 12:00:46 GMT
- Title: Training Aware Sigmoidal Optimizer
- Authors: David Mac\^edo, Pedro Dreyer, Teresa Ludermir, Cleber Zanchettin
- Abstract summary: Training Aware Sigmoidal functions present landscapes with much more saddle loss than local minima.
We proposed the Training Aware Sigmoidal functions (TASO), which consists of a two-phases automated learning rate schedule.
We compared the proposed approach with commonly used adaptive learning rate schedules such as Adam, RMS, and Adagrad.
- Score: 2.99368851209995
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Proper optimization of deep neural networks is an open research question
since an optimal procedure to change the learning rate throughout training is
still unknown. Manually defining a learning rate schedule involves troublesome
time-consuming try and error procedures to determine hyperparameters such as
learning rate decay epochs and learning rate decay rates. Although adaptive
learning rate optimizers automatize this process, recent studies suggest they
may produce overffiting and reduce performance when compared to fine-tuned
learning rate schedules. Considering that deep neural networks loss functions
present landscapes with much more saddle points than local minima, we proposed
the Training Aware Sigmoidal Optimizer (TASO), which consists of a two-phases
automated learning rate schedule. The first phase uses a high learning rate to
fast traverse the numerous saddle point, while the second phase uses low
learning rate to slowly approach the center of the local minimum previously
found. We compared the proposed approach with commonly used adaptive learning
rate schedules such as Adam, RMSProp, and Adagrad. Our experiments showed that
TASO outperformed all competing methods in both optimal (i.e., performing
hyperparameter validation) and suboptimal (i.e., using default hyperparameters)
scenarios.
Related papers
- Understanding Optimization in Deep Learning with Central Flows [53.66160508990508]
We show that an RMS's implicit behavior can be explicitly captured by a "central flow:" a differential equation.
We show that these flows can empirically predict long-term optimization trajectories of generic neural networks.
arXiv Detail & Related papers (2024-10-31T17:58:13Z) - Unlearning as multi-task optimization: A normalized gradient difference approach with an adaptive learning rate [105.86576388991713]
We introduce a normalized gradient difference (NGDiff) algorithm, enabling us to have better control over the trade-off between the objectives.
We provide a theoretical analysis and empirically demonstrate the superior performance of NGDiff among state-of-the-art unlearning methods on the TOFU and MUSE datasets.
arXiv Detail & Related papers (2024-10-29T14:41:44Z) - Learning Rate Optimization for Deep Neural Networks Using Lipschitz Bandits [9.361762652324968]
A properly tuned learning rate leads to faster training and higher test accuracy.
We propose a Lipschitz bandit-driven approach for tuning the learning rate of neural networks.
arXiv Detail & Related papers (2024-09-15T16:21:55Z) - Normalization and effective learning rates in reinforcement learning [52.59508428613934]
Normalization layers have recently experienced a renaissance in the deep reinforcement learning and continual learning literature.
We show that normalization brings with it a subtle but important side effect: an equivalence between growth in the norm of the network parameters and decay in the effective learning rate.
We propose to make the learning rate schedule explicit with a simple re- parameterization which we call Normalize-and-Project.
arXiv Detail & Related papers (2024-07-01T20:58:01Z) - Learning rate adaptive stochastic gradient descent optimization methods: numerical simulations for deep learning methods for partial differential equations and convergence analyses [5.052293146674794]
It is known that the standard descent (SGD) optimization method, as well as accelerated and adaptive SGD optimization methods such as the Adam fail to converge if the learning rates do not converge to zero.
In this work we propose and study a learning-rate-adaptive approach for SGD optimization methods in which the learning rate is adjusted based on empirical estimates.
arXiv Detail & Related papers (2024-06-20T14:07:39Z) - Learning Rate Perturbation: A Generic Plugin of Learning Rate Schedule
towards Flatter Local Minima [40.70374106466073]
We propose a generic learning rate schedule plugin called LEArning Rate Perturbation (LEAP)
LEAP can be applied to various learning rate schedules to improve the model training by introducing a certain perturbation to the learning rate.
We conduct extensive experiments which show that training with LEAP can improve the performance of various deep learning models on diverse datasets.
arXiv Detail & Related papers (2022-08-25T05:05:18Z) - Meta-Learning with Adaptive Hyperparameters [55.182841228303225]
We focus on a complementary factor in MAML framework, inner-loop optimization (or fast adaptation)
We propose a new weight update rule that greatly enhances the fast adaptation process.
arXiv Detail & Related papers (2020-10-31T08:05:34Z) - Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem)
AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient.
Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z) - AdaS: Adaptive Scheduling of Stochastic Gradients [50.80697760166045]
We introduce the notions of textit"knowledge gain" and textit"mapping condition" and propose a new algorithm called Adaptive Scheduling (AdaS)
Experimentation reveals that, using the derived metrics, AdaS exhibits: (a) faster convergence and superior generalization over existing adaptive learning methods; and (b) lack of dependence on a validation set to determine when to stop training.
arXiv Detail & Related papers (2020-06-11T16:36:31Z) - Automatic, Dynamic, and Nearly Optimal Learning Rate Specification by
Local Quadratic Approximation [7.386152866234369]
In deep learning tasks, the learning rate determines the update step size in each iteration.
We propose a novel optimization method based on local quadratic approximation (LQA)
arXiv Detail & Related papers (2020-04-07T10:55:12Z) - Statistical Adaptive Stochastic Gradient Methods [34.859895010071234]
We propose a statistical adaptive procedure called SALSA for automatically scheduling the learning rate (step size) in gradient methods.
SALSA first uses a smoothed line-search procedure to gradually increase the learning rate, then automatically decreases the learning rate.
The method for decreasing the learning rate is based on a new statistical test for detecting station switches when using a constant step size.
arXiv Detail & Related papers (2020-02-25T00:04:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.