Where Do Large Learning Rates Lead Us?
- URL: http://arxiv.org/abs/2410.22113v1
- Date: Tue, 29 Oct 2024 15:14:37 GMT
- Title: Where Do Large Learning Rates Lead Us?
- Authors: Ildus Sadrtdinov, Maxim Kodryan, Eduard Pokonechny, Ekaterina Lobacheva, Dmitry Vetrov,
- Abstract summary: We show that only a narrow range of initial LRs leads to optimal results after fine-tuning with a small LR or weight averaging.
We show that these initial LRs result in a sparse set of learned features, with a clear focus on those most relevant for the task.
In contrast, starting training with too small LRs leads to unstable minima and attempts to learn all features simultaneously, resulting in poor generalization.
- Score: 5.305784285588872
- License:
- Abstract: It is generally accepted that starting neural networks training with large learning rates (LRs) improves generalization. Following a line of research devoted to understanding this effect, we conduct an empirical study in a controlled setting focusing on two questions: 1) how large an initial LR is required for obtaining optimal quality, and 2) what are the key differences between models trained with different LRs? We discover that only a narrow range of initial LRs slightly above the convergence threshold lead to optimal results after fine-tuning with a small LR or weight averaging. By studying the local geometry of reached minima, we observe that using LRs from this optimal range allows for the optimization to locate a basin that only contains high-quality minima. Additionally, we show that these initial LRs result in a sparse set of learned features, with a clear focus on those most relevant for the task. In contrast, starting training with too small LRs leads to unstable minima and attempts to learn all features simultaneously, resulting in poor generalization. Conversely, using initial LRs that are too large fails to detect a basin with good solutions and extract meaningful patterns from the data.
Related papers
- ClearSR: Latent Low-Resolution Image Embeddings Help Diffusion-Based Real-World Super Resolution Models See Clearer [68.72454974431749]
We present ClearSR, a new method that can better take advantage of latent low-resolution image (LR) embeddings for diffusion-based real-world image super-resolution (Real-ISR)
Our model can achieve better performance across multiple metrics on several test sets and generate more consistent SR results with LR images than existing methods.
arXiv Detail & Related papers (2024-10-18T08:35:57Z) - Boosting Deep Ensembles with Learning Rate Tuning [1.6021932740447968]
Learning Rate (LR) has a high impact on deep learning training performance.
This paper presents a novel framework, LREnsemble, to leverage effective learning rate tuning to boost deep ensemble performance.
arXiv Detail & Related papers (2024-10-10T02:59:38Z) - Scaling Optimal LR Across Token Horizons [81.29631219839311]
We show how optimal learning rate depends on token horizon in LLM training.
We also provide evidence that LLama-1 used too high LR, and estimate the performance hit from this.
arXiv Detail & Related papers (2024-09-30T03:32:02Z) - Large Learning Rates Improve Generalization: But How Large Are We
Talking About? [6.218417024312705]
Recent research recommends starting neural networks training with large learning rates (LRs) to achieve the best generalization.
Our study clarifies the initial LR ranges that provide optimal results for subsequent training with a small LR or weight averaging.
arXiv Detail & Related papers (2023-11-19T11:36:35Z) - Selecting and Composing Learning Rate Policies for Deep Neural Networks [10.926538783768219]
This paper presents a systematic approach to selecting and composing an LR policy for effective Deep Neural Networks (DNNs) training.
We develop an LR tuning mechanism for auto-verification of a given LR policy with respect to the desired accuracy goal under the pre-defined training time constraint.
Second, we develop an LR policy recommendation system (LRBench) to select and compose good LR policies from the same and/or different LR functions through dynamic tuning.
Third, we extend LRBench by supporting different DNNs and show the significant mutual impact of different LR policies and different policies.
arXiv Detail & Related papers (2022-10-24T03:32:59Z) - A Wasserstein Minimax Framework for Mixed Linear Regression [69.40394595795544]
Multi-modal distributions are commonly used to model clustered data in learning tasks.
We propose an optimal transport-based framework for Mixed Linear Regression problems.
arXiv Detail & Related papers (2021-06-14T16:03:51Z) - MLR-SNet: Transferable LR Schedules for Heterogeneous Tasks [56.66010634895913]
The learning rate (LR) is one of the most important hyper-learned network parameters in gradient descent (SGD) training networks (DNN)
In this paper, we propose to learn a proper LR schedule for MLR-SNet tasks.
We also make MLR-SNet to query tasks like different noises, architectures, data modalities, sizes from the training ones, and achieve or even better performance.
arXiv Detail & Related papers (2020-07-29T01:18:58Z) - Closed-loop Matters: Dual Regression Networks for Single Image
Super-Resolution [73.86924594746884]
Deep neural networks have exhibited promising performance in image super-resolution.
These networks learn a nonlinear mapping function from low-resolution (LR) images to high-resolution (HR) images.
We propose a dual regression scheme by introducing an additional constraint on LR data to reduce the space of the possible functions.
arXiv Detail & Related papers (2020-03-16T04:23:42Z) - PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of
Generative Models [77.32079593577821]
PULSE (Photo Upsampling via Latent Space Exploration) generates high-resolution, realistic images at resolutions previously unseen in the literature.
Our method outperforms state-of-the-art methods in perceptual quality at higher resolutions and scale factors than previously possible.
arXiv Detail & Related papers (2020-03-08T16:44:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.