Related papers: Stagewise Enlargement of Batch Size for SGD-based Learning

Stagewise Enlargement of Batch Size for SGD-based Learning

URL: http://arxiv.org/abs/2002.11601v2
Date: Thu, 27 Feb 2020 03:13:52 GMT
Title: Stagewise Enlargement of Batch Size for SGD-based Learning
Authors: Shen-Yi Zhao, Yin-Peng Xie, and Wu-Jun Li
Abstract summary: Existing research shows that the batch size can seriously affect the performance of gradient descent(SGD) based learning. We propose a novel method, called underlinestagewise underlineenlargement of underlinebatch underlinesize(mboxSEBS), to set proper batch size for SGD.
Score: 20.212176652894495
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing research shows that the batch size can seriously affect the performance of stochastic gradient descent~(SGD) based learning, including training speed and generalization ability. A larger batch size typically results in less parameter updates. In distributed training, a larger batch size also results in less frequent communication. However, a larger batch size can make a generalization gap more easily. Hence, how to set a proper batch size for SGD has recently attracted much attention. Although some methods about setting batch size have been proposed, the batch size problem has still not been well solved. In this paper, we first provide theory to show that a proper batch size is related to the gap between initialization and optimum of the model parameter. Then based on this theory, we propose a novel method, called \underline{s}tagewise \underline{e}nlargement of \underline{b}atch \underline{s}ize~(\mbox{SEBS}), to set proper batch size for SGD. More specifically, \mbox{SEBS} adopts a multi-stage scheme, and enlarges the batch size geometrically by stage. We theoretically prove that, compared to classical stagewise SGD which decreases learning rate by stage, \mbox{SEBS} can reduce the number of parameter updates without increasing generalization error. SEBS is suitable for \mbox{SGD}, momentum \mbox{SGD} and AdaGrad. Empirical results on real data successfully verify the theories of \mbox{SEBS}. Furthermore, empirical results also show that SEBS can outperform other baselines.

Related papers

Small Batch Size Training for Language Models: When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful [71.96579951744897]
Conventional wisdom dictates that small batch sizes make language model pretraining and fine-tuning unstable, motivating accumulation.<n>In this work, we revisit small batch sizes all the way down to batch size one, and we propose a rule for scaling Adam hyper parameters to small batch sizes.
arXiv Detail & Related papers (2025-07-09T17:57:36Z)
LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive. Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones. We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z)
Increasing Batch Size Improves Convergence of Stochastic Gradient Descent with Momentum [0.6906005491572401]
gradient descent with momentum (SGDM) has been well studied in both theory and practice. We focus on mini-batch SGDM with constant learning rate and constant momentum weight.
arXiv Detail & Related papers (2025-01-15T15:53:27Z)
ARB-LLM: Alternating Refined Binarizations for Large Language Models [82.24826360906341]
ARB-LLM is a novel 1-bit post-training quantization (PTQ) technique tailored for Large Language Models (LLMs) As a binary PTQ method, our ARB-LLM$_textRC$ is the first to surpass FP16 models of the same size.
arXiv Detail & Related papers (2024-10-04T03:50:10Z)
Sparse is Enough in Fine-tuning Pre-trained Large Language Models [98.46493578509039]
We propose a gradient-based sparse fine-tuning algorithm, named Sparse Increment Fine-Tuning (SIFT) We validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning.
arXiv Detail & Related papers (2023-12-19T06:06:30Z)
Provably Efficient High-Dimensional Bandit Learning with Batched Feedbacks [93.00280593719513]
We study high-dimensional multi-armed contextual bandits with batched feedback where the $T$ steps of online interactions are divided into $L$ batches. In specific, each batch collects data according to a policy that depends on previous batches and the rewards are revealed only at the end of the batch. Our algorithm achieves regret bounds comparable to those in fully sequential setting with only $mathcalO( log T)$ batches.
arXiv Detail & Related papers (2023-11-22T06:06:54Z)
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z)
Contrastive Weight Regularization for Large Minibatch SGD [8.927483136015283]
We introduce a novel regularization technique, namely distinctive regularization (DReg) DReg replicates a certain layer of the deep network and encourages the parameters of both layers to be diverse. We empirically show that optimizing the neural network with DReg using large-batch SGD achieves a significant boost in the convergence and improved performance.
arXiv Detail & Related papers (2020-11-17T22:07:38Z)
Double Forward Propagation for Memorized Batch Normalization [68.34268180871416]
Batch Normalization (BN) has been a standard component in designing deep neural networks (DNNs) We propose a memorized batch normalization (MBN) which considers multiple recent batches to obtain more accurate and robust statistics. Compared to related methods, the proposed MBN exhibits consistent behaviors in both training and inference.
arXiv Detail & Related papers (2020-10-10T08:48:41Z)
Stochastic Normalized Gradient Descent with Momentum for Large-Batch Training [9.964630991617764]
gradient descent(SGD) and its variants have been the dominating optimization methods in machine learning. In this paper, we propose a simple yet effective method, called normalized gradient descent with momentum(SNGM) for largebatch training.
arXiv Detail & Related papers (2020-07-28T04:34:43Z)
On the Generalization Benefit of Noise in Stochastic Gradient Descent [34.127525925676416]
It has long been argued that minibatch gradient descent can generalize better than large batch gradient descent in deep neural networks. We show that small or moderately large batch sizes can substantially outperform very large batches on the test set.
arXiv Detail & Related papers (2020-06-26T16:18:54Z)
AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights [53.8489656709356]
Normalization techniques are a boon for modern deep learning. It is often overlooked, however, that the additional introduction of momentum results in a rapid reduction in effective step sizes for scale-invariant weights. In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances.
arXiv Detail & Related papers (2020-06-15T08:35:15Z)
Scaling Distributed Training with Adaptive Summation [2.6210166639679]
This paper introduces a novel method to combine gradients called Adasum (for adaptive sum) that converges faster than prior work. Adasum is easy to implement, almost as efficient as simply summing gradients, and is integrated into the open-source toolkit Horovod.
arXiv Detail & Related papers (2020-06-04T15:08:20Z)
Extended Batch Normalization [3.377000738091241]
Batch normalization (BN) has become a standard technique for training the modern deep networks. In this paper, we propose a simple but effective method, called extended batch normalization (EBN) Experiments show that extended batch normalization alleviates the problem of batch normalization with small batch size while achieving close performances to batch normalization with large batch size.
arXiv Detail & Related papers (2020-03-12T01:53:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.