The Limit of the Batch Size
- URL: http://arxiv.org/abs/2006.08517v1
- Date: Mon, 15 Jun 2020 16:18:05 GMT
- Title: The Limit of the Batch Size
- Authors: Yang You and Yuhui Wang and Huan Zhang and Zhao Zhang and James Demmel
and Cho-Jui Hsieh
- Abstract summary: Large-batch training is an efficient approach for current distributed deep learning systems.
In this paper, we focus on studying the limit of the batch size.
We provide detailed numerical optimization instructions for step-by-step comparison.
- Score: 79.8857712299211
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-batch training is an efficient approach for current distributed deep
learning systems. It has enabled researchers to reduce the ImageNet/ResNet-50
training from 29 hours to around 1 minute. In this paper, we focus on studying
the limit of the batch size. We think it may provide a guidance to AI
supercomputer and algorithm designers. We provide detailed numerical
optimization instructions for step-by-step comparison. Moreover, it is
important to understand the generalization and optimization performance of huge
batch training. Hoffer et al. introduced "ultra-slow diffusion" theory to
large-batch training. However, our experiments show contradictory results with
the conclusion of Hoffer et al. We provide comprehensive experimental results
and detailed analysis to study the limitations of batch size scaling and
"ultra-slow diffusion" theory. For the first time we scale the batch size on
ImageNet to at least a magnitude larger than all previous work, and provide
detailed studies on the performance of many state-of-the-art optimization
schemes under this setting. We propose an optimization recipe that is able to
improve the top-1 test accuracy by 18% compared to the baseline.
Related papers
- Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling [27.058009599819012]
We study the connection between optimal learning rates and batch sizes for Adam styles.
We prove that the optimal learning rate first rises and then falls as the batch size increases.
arXiv Detail & Related papers (2024-05-23T13:52:36Z) - Accelerating Large Batch Training via Gradient Signal to Noise Ratio
(GSNR) [16.351871316985598]
We develop the variance reduced gradient descent technique (VRGD) based on the gradient signal to noise ratio (GSNR)
VRGD can accelerate training ($1sim 2 times$), narrow generalization gap and improve final accuracy.
We improve ImageNet Top-1 accuracy at 96k by $0.52pp$ than LARS.
arXiv Detail & Related papers (2023-09-24T16:08:21Z) - Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch
Size [58.762959061522736]
We show that scaling mini-batch sizes with appropriate learning rate adjustments can speed up the training process by orders of magnitude.
We show that scaling the mini-batch size and naively adjusting the learning rate allows for (1) a reduced size of the Q-ensemble, (2) stronger penalization of out-of-distribution actions, and (3) improved convergence time.
arXiv Detail & Related papers (2022-11-20T21:48:25Z) - Concurrent Adversarial Learning for Large-Batch Training [83.55868483681748]
Adversarial learning is a natural choice for smoothing the decision surface and biasing towards a flat region.
We propose a novel Concurrent Adversarial Learning (ConAdv) method that decouples the sequential gradient computations in adversarial learning by utilizing staled parameters.
This is the first work successfully scales ResNet-50 training batch size to 96K.
arXiv Detail & Related papers (2021-06-01T04:26:02Z) - Critical Parameters for Scalable Distributed Learning with Large Batches
and Asynchronous Updates [67.19481956584465]
It has been experimentally observed that the efficiency of distributed training with saturation (SGD) depends decisively on the batch size and -- in implementations -- on the staleness.
We show that our results are tight and illustrate key findings in numerical experiments.
arXiv Detail & Related papers (2021-03-03T12:08:23Z) - Accelerated Large Batch Optimization of BERT Pretraining in 54 minutes [9.213729275749452]
We propose an accelerated gradient method called LANS to improve the efficiency of using large mini-batches for training.
It takes 54 minutes on 192 AWS EC2 P3dn.24xlarge instances to achieve a target F1 score of 90.5 or higher on SQuAD v1.1, achieving the fastest BERT training time in the cloud.
arXiv Detail & Related papers (2020-06-24T05:00:41Z) - Adaptive Learning of the Optimal Batch Size of SGD [52.50880550357175]
We propose a method capable of learning the optimal batch size adaptively throughout its iterations for strongly convex and smooth functions.
Our method does this provably, and in our experiments with synthetic and real data robustly exhibits nearly optimal behaviour.
We generalize our method to several new batch strategies not considered in the literature before, including a sampling suitable for distributed implementations.
arXiv Detail & Related papers (2020-05-03T14:28:32Z) - Large Batch Training Does Not Need Warmup [111.07680619360528]
Training deep neural networks using a large batch size has shown promising results and benefits many real-world applications.
In this paper, we propose a novel Complete Layer-wise Adaptive Rate Scaling (CLARS) algorithm for large-batch training.
Based on our analysis, we bridge the gap and illustrate the theoretical insights for three popular large-batch training techniques.
arXiv Detail & Related papers (2020-02-04T23:03:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.