AutoLRS: Automatic Learning-Rate Schedule by Bayesian Optimization on
the Fly
- URL: http://arxiv.org/abs/2105.10762v1
- Date: Sat, 22 May 2021 16:41:10 GMT
- Title: AutoLRS: Automatic Learning-Rate Schedule by Bayesian Optimization on
the Fly
- Authors: Yuchen Jin, Tianyi Zhou, Liangyu Zhao, Yibo Zhu, Chuanxiong Guo, Marco
Canini, Arvind Krishnamurthy
- Abstract summary: We propose AutoLRS, which automatically optimize the learning rate for each training stage by modeling training dynamics.
We demonstrate the advantages and the generality of AutoLRS through extensive experiments of training tasks diverse domains.
- Score: 22.754424957856052
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The learning rate (LR) schedule is one of the most important hyper-parameters
needing careful tuning in training DNNs. However, it is also one of the least
automated parts of machine learning systems and usually costs significant
manual effort and computing. Though there are pre-defined LR schedules and
optimizers with adaptive LR, they introduce new hyperparameters that need to be
tuned separately for different tasks/datasets. In this paper, we consider the
question: Can we automatically tune the LR over the course of training without
human involvement? We propose an efficient method, AutoLRS, which automatically
optimizes the LR for each training stage by modeling training dynamics. AutoLRS
aims to find an LR applied to every $\tau$ steps that minimizes the resulted
validation loss. We solve this black-box optimization on the fly by Bayesian
optimization (BO). However, collecting training instances for BO requires a
system to evaluate each LR queried by BO's acquisition function for $\tau$
steps, which is prohibitively expensive in practice. Instead, we apply each
candidate LR for only $\tau'\ll\tau$ steps and train an exponential model to
predict the validation loss after $\tau$ steps. This mutual-training process
between BO and the loss-prediction model allows us to limit the training steps
invested in the BO search. We demonstrate the advantages and the generality of
AutoLRS through extensive experiments of training DNNs for tasks from diverse
domains using different optimizers. The LR schedules auto-generated by AutoLRS
lead to a speedup of $1.22\times$, $1.43\times$, and $1.5\times$ when training
ResNet-50, Transformer, and BERT, respectively, compared to the LR schedules in
their original papers, and an average speedup of $1.31\times$ over
state-of-the-art heavily-tuned LR schedules.
Related papers
- Where Do Large Learning Rates Lead Us? [5.305784285588872]
We show that only a narrow range of initial LRs leads to optimal results after fine-tuning with a small LR or weight averaging.
We show that these initial LRs result in a sparse set of learned features, with a clear focus on those most relevant for the task.
In contrast, starting training with too small LRs leads to unstable minima and attempts to learn all features simultaneously, resulting in poor generalization.
arXiv Detail & Related papers (2024-10-29T15:14:37Z) - Scaling Optimal LR Across Token Horizons [81.29631219839311]
We show how optimal learning rate depends on token horizon in LLM training.
We also provide evidence that LLama-1 used too high LR, and estimate the performance hit from this.
arXiv Detail & Related papers (2024-09-30T03:32:02Z) - Scaling Law with Learning Rate Annealing [4.121865876406014]
Cross-entropy loss curves of neural language models adhere to a scaling law with learning rate (LR) annealing over training steps.
Applying the scaling law with LR annealing and fitting only one or two training curves, we can accurately predict the loss at any given step across any learning rate (LRS)
arXiv Detail & Related papers (2024-08-20T17:30:48Z) - ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search [50.45155830888697]
ReST-MCTS* integrates process reward guidance with tree search MCTS* for collecting higher-quality reasoning traces.
We first show that the tree-search policy in ReST-MCTS* achieves higher accuracy compared with prior LLM reasoning baselines.
We then show that by using traces searched by this tree-search policy as training data, we can continuously enhance the three language models for multiple iterations.
arXiv Detail & Related papers (2024-06-06T07:40:00Z) - M-L2O: Towards Generalizable Learning-to-Optimize by Test-Time Fast
Self-Adaptation [145.7321032755538]
Learning to Optimize (L2O) has drawn increasing attention as it often remarkably accelerates the optimization procedure of complex tasks.
This paper investigates a potential solution to this open challenge by meta-training an L2O that can perform fast test-time self-adaptation to an out-of-distribution task.
arXiv Detail & Related papers (2023-02-28T19:23:20Z) - Selecting and Composing Learning Rate Policies for Deep Neural Networks [10.926538783768219]
This paper presents a systematic approach to selecting and composing an LR policy for effective Deep Neural Networks (DNNs) training.
We develop an LR tuning mechanism for auto-verification of a given LR policy with respect to the desired accuracy goal under the pre-defined training time constraint.
Second, we develop an LR policy recommendation system (LRBench) to select and compose good LR policies from the same and/or different LR functions through dynamic tuning.
Third, we extend LRBench by supporting different DNNs and show the significant mutual impact of different LR policies and different policies.
arXiv Detail & Related papers (2022-10-24T03:32:59Z) - Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution.
Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x.
We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z) - Automated Learning Rate Scheduler for Large-batch Training [24.20872850681828]
Large-batch training has been essential in leveraging large-scale datasets and models in deep learning.
It often requires a specially designed learning rate (LR) schedule to achieve a comparable level of performance as in smaller batch training.
We propose an automated LR scheduling algorithm which is effective for neural network training with a large batch size under the given epoch budget.
arXiv Detail & Related papers (2021-07-13T05:23:13Z) - A Wasserstein Minimax Framework for Mixed Linear Regression [69.40394595795544]
Multi-modal distributions are commonly used to model clustered data in learning tasks.
We propose an optimal transport-based framework for Mixed Linear Regression problems.
arXiv Detail & Related papers (2021-06-14T16:03:51Z) - MLR-SNet: Transferable LR Schedules for Heterogeneous Tasks [56.66010634895913]
The learning rate (LR) is one of the most important hyper-learned network parameters in gradient descent (SGD) training networks (DNN)
In this paper, we propose to learn a proper LR schedule for MLR-SNet tasks.
We also make MLR-SNet to query tasks like different noises, architectures, data modalities, sizes from the training ones, and achieve or even better performance.
arXiv Detail & Related papers (2020-07-29T01:18:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.