Large Learning Rates Improve Generalization: But How Large Are We
Talking About?
- URL: http://arxiv.org/abs/2311.11303v1
- Date: Sun, 19 Nov 2023 11:36:35 GMT
- Title: Large Learning Rates Improve Generalization: But How Large Are We
Talking About?
- Authors: Ekaterina Lobacheva, Eduard Pockonechnyy, Maxim Kodryan, Dmitry Vetrov
- Abstract summary: Recent research recommends starting neural networks training with large learning rates (LRs) to achieve the best generalization.
Our study clarifies the initial LR ranges that provide optimal results for subsequent training with a small LR or weight averaging.
- Score: 6.218417024312705
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inspired by recent research that recommends starting neural networks training
with large learning rates (LRs) to achieve the best generalization, we explore
this hypothesis in detail. Our study clarifies the initial LR ranges that
provide optimal results for subsequent training with a small LR or weight
averaging. We find that these ranges are in fact significantly narrower than
generally assumed. We conduct our main experiments in a simplified setup that
allows precise control of the learning rate hyperparameter and validate our key
findings in a more practical setting.
Related papers
- Where Do Large Learning Rates Lead Us? [5.305784285588872]
We show that only a narrow range of initial LRs leads to optimal results after fine-tuning with a small LR or weight averaging.
We show that these initial LRs result in a sparse set of learned features, with a clear focus on those most relevant for the task.
In contrast, starting training with too small LRs leads to unstable minima and attempts to learn all features simultaneously, resulting in poor generalization.
arXiv Detail & Related papers (2024-10-29T15:14:37Z) - Boosting Deep Ensembles with Learning Rate Tuning [1.6021932740447968]
Learning Rate (LR) has a high impact on deep learning training performance.
This paper presents a novel framework, LREnsemble, to leverage effective learning rate tuning to boost deep ensemble performance.
arXiv Detail & Related papers (2024-10-10T02:59:38Z) - Scaling Optimal LR Across Token Horizons [81.29631219839311]
We show how optimal learning rate depends on token horizon in LLM training.
We also provide evidence that LLama-1 used too high LR, and estimate the performance hit from this.
arXiv Detail & Related papers (2024-09-30T03:32:02Z) - Take the Bull by the Horns: Hard Sample-Reweighted Continual Training
Improves LLM Generalization [165.98557106089777]
A key challenge is to enhance the capabilities of large language models (LLMs) amid a looming shortage of high-quality training data.
Our study starts from an empirical strategy for the light continual training of LLMs using their original pre-training data sets.
We then formalize this strategy into a principled framework of Instance-Reweighted Distributionally Robust Optimization.
arXiv Detail & Related papers (2024-02-22T04:10:57Z) - Back to Basics: A Simple Recipe for Improving Out-of-Domain Retrieval in
Dense Encoders [63.28408887247742]
We study whether training procedures can be improved to yield better generalization capabilities in the resulting models.
We recommend a simple recipe for training dense encoders: Train on MSMARCO with parameter-efficient methods, such as LoRA, and opt for using in-batch negatives unless given well-constructed hard negatives.
arXiv Detail & Related papers (2023-11-16T10:42:58Z) - Small batch deep reinforcement learning [31.69289254478042]
In value-based deep reinforcement learning, the batch size parameter specifies how many transitions to sample for each gradient update.
In this work we present a broad empirical study that suggests em reducing the batch size can result in a number of significant performance gains.
arXiv Detail & Related papers (2023-10-05T20:31:37Z) - RPLKG: Robust Prompt Learning with Knowledge Graph [11.893917358053004]
We propose a new method, robust prompt learning with knowledge graph (RPLKG)
Based on the knowledge graph, we automatically design diverse interpretable and meaningful prompt sets.
RPLKG shows a significant performance improvement compared to zero-shot learning.
arXiv Detail & Related papers (2023-04-21T08:22:58Z) - Learning to Optimize for Reinforcement Learning [58.01132862590378]
Reinforcement learning (RL) is essentially different from supervised learning, and in practice, these learneds do not work well even in simple RL tasks.
Agent-gradient distribution is non-independent and identically distributed, leading to inefficient meta-training.
We show that, although only trained in toy tasks, our learned can generalize unseen complex tasks in Brax.
arXiv Detail & Related papers (2023-02-03T00:11:02Z) - Large-Scale Deep Learning Optimizations: A Comprehensive Survey [7.901786481399378]
We aim to provide a sketch about the optimizations for large-scale deep learning with regard to the model accuracy and model efficiency.
We investigate algorithms that are most commonly used for optimizing, elaborate the debatable topic of generalization gap arises in large-batch training, and review the SOTA strategies in addressing the communication overhead and reducing the memory footprints.
arXiv Detail & Related papers (2021-11-01T11:53:30Z) - Dynamics Generalization via Information Bottleneck in Deep Reinforcement
Learning [90.93035276307239]
We propose an information theoretic regularization objective and an annealing-based optimization method to achieve better generalization ability in RL agents.
We demonstrate the extreme generalization benefits of our approach in different domains ranging from maze navigation to robotic tasks.
This work provides a principled way to improve generalization in RL by gradually removing information that is redundant for task-solving.
arXiv Detail & Related papers (2020-08-03T02:24:20Z) - Robust Sampling in Deep Learning [62.997667081978825]
Deep learning requires regularization mechanisms to reduce overfitting and improve generalization.
We address this problem by a new regularization method based on distributional robust optimization.
During the training, the selection of samples is done according to their accuracy in such a way that the worst performed samples are the ones that contribute the most in the optimization.
arXiv Detail & Related papers (2020-06-04T09:46:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.