Automated Learning Rate Scheduler for Large-batch Training
- URL: http://arxiv.org/abs/2107.05855v1
- Date: Tue, 13 Jul 2021 05:23:13 GMT
- Title: Automated Learning Rate Scheduler for Large-batch Training
- Authors: Chiheon Kim, Saehoon Kim, Jongmin Kim, Donghoon Lee, Sungwoong Kim
- Abstract summary: Large-batch training has been essential in leveraging large-scale datasets and models in deep learning.
It often requires a specially designed learning rate (LR) schedule to achieve a comparable level of performance as in smaller batch training.
We propose an automated LR scheduling algorithm which is effective for neural network training with a large batch size under the given epoch budget.
- Score: 24.20872850681828
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-batch training has been essential in leveraging large-scale datasets
and models in deep learning. While it is computationally beneficial to use
large batch sizes, it often requires a specially designed learning rate (LR)
schedule to achieve a comparable level of performance as in smaller batch
training. Especially, when the number of training epochs is constrained, the
use of a large LR and a warmup strategy is critical in the final performance of
large-batch training due to the reduced number of updating steps. In this work,
we propose an automated LR scheduling algorithm which is effective for neural
network training with a large batch size under the given epoch budget. In
specific, the whole schedule consists of two phases: adaptive warmup and
predefined decay, where the LR is increased until the training loss no longer
decreases and decreased to zero until the end of training. Here, whether the
training loss has reached the minimum value is robustly checked with Gaussian
process smoothing in an online manner with a low computational burden. Coupled
with adaptive stochastic optimizers such as AdamP and LAMB, the proposed
scheduler successfully adjusts the LRs without cumbersome hyperparameter tuning
and achieves comparable or better performances than tuned baselines on various
image classification benchmarks and architectures with a wide range of batch
sizes.
Related papers
- Optimization Hyper-parameter Laws for Large Language Models [56.322914260197734]
We present Opt-Laws, a framework that captures the relationship between hyper- parameters and training outcomes.
Our validation across diverse model sizes and data scales demonstrates Opt-Laws' ability to accurately predict training loss.
This approach significantly reduces computational costs while enhancing overall model performance.
arXiv Detail & Related papers (2024-09-07T09:37:19Z) - Iteration and Stochastic First-order Oracle Complexities of Stochastic
Gradient Descent using Constant and Decaying Learning Rates [0.8158530638728501]
We show that the performance of descent (SGD) depends on not only the learning rate but also the batch size.
We show that measured critical batch sizes are close to the sizes estimated from our theoretical results.
arXiv Detail & Related papers (2024-02-23T14:24:45Z) - Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch
Size [58.762959061522736]
We show that scaling mini-batch sizes with appropriate learning rate adjustments can speed up the training process by orders of magnitude.
We show that scaling the mini-batch size and naively adjusting the learning rate allows for (1) a reduced size of the Q-ensemble, (2) stronger penalization of out-of-distribution actions, and (3) improved convergence time.
arXiv Detail & Related papers (2022-11-20T21:48:25Z) - Hyper-Learning for Gradient-Based Batch Size Adaptation [2.944323057176686]
Scheduling the batch size to increase is an effective strategy to control noise when training deep neural networks.
We introduce Arbiter as a new hyper-optimization algorithm to perform batch size adaptations for learnable schedulings.
We demonstrate Arbiter's effectiveness in several illustrative experiments.
arXiv Detail & Related papers (2022-05-17T11:01:14Z) - Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution.
Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x.
We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z) - MLR-SNet: Transferable LR Schedules for Heterogeneous Tasks [56.66010634895913]
The learning rate (LR) is one of the most important hyper-learned network parameters in gradient descent (SGD) training networks (DNN)
In this paper, we propose to learn a proper LR schedule for MLR-SNet tasks.
We also make MLR-SNet to query tasks like different noises, architectures, data modalities, sizes from the training ones, and achieve or even better performance.
arXiv Detail & Related papers (2020-07-29T01:18:58Z) - AdaScale SGD: A User-Friendly Algorithm for Distributed Training [29.430153773234363]
We propose AdaScale SGD, an algorithm that reliably adapts learning rates to large-batch training.
By continually adapting to the gradient's variance, AdaScale achieves speed-ups for a wide range of batch sizes.
This includes large-batch training with no model degradation for machine translation, image classification, object detection, and speech recognition tasks.
arXiv Detail & Related papers (2020-07-09T23:26:13Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Large Batch Training Does Not Need Warmup [111.07680619360528]
Training deep neural networks using a large batch size has shown promising results and benefits many real-world applications.
In this paper, we propose a novel Complete Layer-wise Adaptive Rate Scaling (CLARS) algorithm for large-batch training.
Based on our analysis, we bridge the gap and illustrate the theoretical insights for three popular large-batch training techniques.
arXiv Detail & Related papers (2020-02-04T23:03:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.