Cyclical Log Annealing as a Learning Rate Scheduler
- URL: http://arxiv.org/abs/2403.14685v1
- Date: Wed, 13 Mar 2024 14:07:20 GMT
- Title: Cyclical Log Annealing as a Learning Rate Scheduler
- Authors: Philip Naveen,
- Abstract summary: A learning rate scheduler is a set of instructions for varying search stepsizes during model training processes.
This paper introduces a new logarithmic method using harsh restarting of step sizes through descent gradient.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A learning rate scheduler is a predefined set of instructions for varying search stepsizes during model training processes. This paper introduces a new logarithmic method using harsh restarting of step sizes through stochastic gradient descent. Cyclical log annealing implements the restart pattern more aggressively to maybe allow the usage of more greedy algorithms on the online convex optimization framework. The algorithm was tested on the CIFAR-10 image datasets, and seemed to perform analogously with cosine annealing on large transformer-enhanced residual neural networks. Future experiments would involve testing the scheduler in generative adversarial networks and finding the best parameters for the scheduler with more experiments.
Related papers
- Ordering for Non-Replacement SGD [7.11967773739707]
We seek to find an ordering that can improve the convergence rates for the non-replacement form of the algorithm.
We develop optimal orderings for constant and decreasing step sizes for strongly convex and convex functions.
In addition, we are able to combine the ordering with mini-batch and further apply it to more complex neural networks.
arXiv Detail & Related papers (2023-06-28T00:46:58Z) - Composite FORCE learning of chaotic echo state networks for time-series
prediction [7.650966670809372]
This paper proposes a composite FORCE learning method to train ESNs whose initial activity is spontaneously chaotic.
numerical results have shown that it significantly improves learning and prediction performances compared with existing methods.
arXiv Detail & Related papers (2022-07-06T03:44:09Z) - Optimizing Convergence for Iterative Learning of ARIMA for Stationary
Time Series [1.9444242128493845]
This work focuses on the computational less expensive Online Gradient Descent optimization method, which became popular for learning of neural networks in recent years.
We propose a new approach combining different Online Gradient Descent learners (such as Adam, AMSGrad, Adagrad, Nesterov) to achieve fast convergence.
arXiv Detail & Related papers (2021-01-25T12:07:46Z) - Short-Term Memory Optimization in Recurrent Neural Networks by
Autoencoder-based Initialization [79.42778415729475]
We explore an alternative solution based on explicit memorization using linear autoencoders for sequences.
We show how such pretraining can better support solving hard classification tasks with long sequences.
We show that the proposed approach achieves a much lower reconstruction error for long sequences and a better gradient propagation during the finetuning phase.
arXiv Detail & Related papers (2020-11-05T14:57:16Z) - Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem)
AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient.
Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z) - Automatic Tuning of Stochastic Gradient Descent with Bayesian
Optimisation [8.340191147575307]
We introduce an original probabilistic model for traces of optimisers, based on latent Gaussian processes and an auto-/regressive formulation.
It flexibly adjusts to abrupt changes of behaviours induced by new learning rate values.
It is well-suited to tackle a set of problems: first, for the on-line adaptation of the learning rate for a cold-started run; then, for tuning the schedule for a set of similar tasks, as well as warm-starting it for a new task.
arXiv Detail & Related papers (2020-06-25T13:18:18Z) - AdaS: Adaptive Scheduling of Stochastic Gradients [50.80697760166045]
We introduce the notions of textit"knowledge gain" and textit"mapping condition" and propose a new algorithm called Adaptive Scheduling (AdaS)
Experimentation reveals that, using the derived metrics, AdaS exhibits: (a) faster convergence and superior generalization over existing adaptive learning methods; and (b) lack of dependence on a validation set to determine when to stop training.
arXiv Detail & Related papers (2020-06-11T16:36:31Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Stochastic gradient descent with random learning rate [0.0]
We propose to optimize neural networks with a uniformly-distributed random learning rate.
By comparing the random learning rate protocol with cyclic and constant protocols, we suggest that the random choice is generically the best strategy in the small learning rate regime.
We provide supporting evidence through experiments on both shallow, fully-connected and deep, convolutional neural networks for image classification on the MNIST and CIFAR10 datasets.
arXiv Detail & Related papers (2020-03-15T21:36:46Z) - Subset Sampling For Progressive Neural Network Learning [106.12874293597754]
Progressive Neural Network Learning is a class of algorithms that incrementally construct the network's topology and optimize its parameters based on the training data.
We propose to speed up this process by exploiting subsets of training data at each incremental training step.
Experimental results in object, scene and face recognition problems demonstrate that the proposed approach speeds up the optimization procedure considerably.
arXiv Detail & Related papers (2020-02-17T18:57:33Z) - Large Batch Training Does Not Need Warmup [111.07680619360528]
Training deep neural networks using a large batch size has shown promising results and benefits many real-world applications.
In this paper, we propose a novel Complete Layer-wise Adaptive Rate Scaling (CLARS) algorithm for large-batch training.
Based on our analysis, we bridge the gap and illustrate the theoretical insights for three popular large-batch training techniques.
arXiv Detail & Related papers (2020-02-04T23:03:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.