Related papers: Learning Rate Perturbation: A Generic Plugin of Learning Rate Schedule towards Flatter Local Minima

Learning Rate Perturbation: A Generic Plugin of Learning Rate Schedule towards Flatter Local Minima

URL: http://arxiv.org/abs/2208.11873v1
Date: Thu, 25 Aug 2022 05:05:18 GMT
Title: Learning Rate Perturbation: A Generic Plugin of Learning Rate Schedule towards Flatter Local Minima
Authors: Hengyu Liu, Qiang Fu, Lun Du, Tiancheng Zhang, Ge Yu, Shi Han and Dongmei Zhang
Abstract summary: We propose a generic learning rate schedule plugin called LEArning Rate Perturbation (LEAP) LEAP can be applied to various learning rate schedules to improve the model training by introducing a certain perturbation to the learning rate. We conduct extensive experiments which show that training with LEAP can improve the performance of various deep learning models on diverse datasets.
Score: 40.70374106466073
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Learning rate is one of the most important hyper-parameters that has a significant influence on neural network training. Learning rate schedules are widely used in real practice to adjust the learning rate according to pre-defined schedules for fast convergence and good generalization. However, existing learning rate schedules are all heuristic algorithms and lack theoretical support. Therefore, people usually choose the learning rate schedules through multiple ad-hoc trials, and the obtained learning rate schedules are sub-optimal. To boost the performance of the obtained sub-optimal learning rate schedule, we propose a generic learning rate schedule plugin, called LEArning Rate Perturbation (LEAP), which can be applied to various learning rate schedules to improve the model training by introducing a certain perturbation to the learning rate. We found that, with such a simple yet effective strategy, training processing exponentially favors flat minima rather than sharp minima with guaranteed convergence, which leads to better generalization ability. In addition, we conduct extensive experiments which show that training with LEAP can improve the performance of various deep learning models on diverse datasets using various learning rate schedules (including constant learning rate).

Related papers

Tuning Learning Rates with the Cumulative-Learning Constant [0.0]
A previously unrecognized proportionality between learning rates and dataset sizes is discovered.<n>A cumulative learning constant is identified, offering a framework for designing and optimizing advanced learning rate schedules.
arXiv Detail & Related papers (2025-04-30T00:07:48Z)
A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules [67.87680482844884]
We present an empirical law that describes how the pretraining loss of large language models evolves under different learning rate schedules. Our proposed law takes a multi-power form, combining a power law based on the sum of learning rates and additional power laws to account for a loss reduction effect induced by learning rate decay.
arXiv Detail & Related papers (2025-03-17T04:36:45Z)
Dynamic Learning Rate for Deep Reinforcement Learning: A Bandit Approach [0.9549646359252346]
We propose dynamic Learning Rate for deep Reinforcement Learning (LRRL) LRRL is a meta-learning approach that selects the learning rate based on the agent's performance during training. Our empirical results demonstrate that LRRL can substantially improve the performance of deep RL algorithms.
arXiv Detail & Related papers (2024-10-16T14:15:28Z)
Normalization and effective learning rates in reinforcement learning [52.59508428613934]
Normalization layers have recently experienced a renaissance in the deep reinforcement learning and continual learning literature. We show that normalization brings with it a subtle but important side effect: an equivalence between growth in the norm of the network parameters and decay in the effective learning rate. We propose to make the learning rate schedule explicit with a simple re- parameterization which we call Normalize-and-Project.
arXiv Detail & Related papers (2024-07-01T20:58:01Z)
Optimal Linear Decay Learning Rate Schedules and Further Refinements [46.79573408189601]
Learning rate schedules used in practice bear little resemblance to those recommended by theory. We close much of this theory/practice gap, and as a consequence are able to derive new problem-adaptive learning rate schedules.
arXiv Detail & Related papers (2023-10-11T19:16:35Z)
FedLALR: Client-Specific Adaptive Learning Rates Achieve Linear Speedup for Non-IID Data [54.81695390763957]
Federated learning is an emerging distributed machine learning method. We propose a heterogeneous local variant of AMSGrad, named FedLALR, in which each client adjusts its learning rate. We show that our client-specified auto-tuned learning rate scheduling can converge and achieve linear speedup with respect to the number of clients.
arXiv Detail & Related papers (2023-09-18T12:35:05Z)
Frequency-aware SGD for Efficient Embedding Learning with Provable Benefits [35.543124939636044]
We propose a large Descent (Counter-based)-aware Descent, which applies a frequency-dependent learning rate for each token, and exhibits provable speed-up compared to SGD when the token distribution is imbalanced.
arXiv Detail & Related papers (2021-10-10T16:17:43Z)
Training Aware Sigmoidal Optimizer [2.99368851209995]
Training Aware Sigmoidal functions present landscapes with much more saddle loss than local minima. We proposed the Training Aware Sigmoidal functions (TASO), which consists of a two-phases automated learning rate schedule. We compared the proposed approach with commonly used adaptive learning rate schedules such as Adam, RMS, and Adagrad.
arXiv Detail & Related papers (2021-02-17T12:00:46Z)
Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem) AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient. Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z)
The Two Regimes of Deep Network Training [93.84309968956941]
We study the effects of different learning schedules and the appropriate way to select them. To this end, we isolate two distinct phases, which we refer to as the "large-step regime" and the "small-step regime" Our training algorithm can significantly simplify learning rate schedules.
arXiv Detail & Related papers (2020-02-24T17:08:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.