Increasing Batch Size Improves Convergence of Stochastic Gradient Descent with Momentum
- URL: http://arxiv.org/abs/2501.08883v1
- Date: Wed, 15 Jan 2025 15:53:27 GMT
- Title: Increasing Batch Size Improves Convergence of Stochastic Gradient Descent with Momentum
- Authors: Keisuke Kamo, Hideaki Iiduka,
- Abstract summary: gradient descent with momentum (SGDM) has been well studied in both theory and practice.
We focus on mini-batch SGDM with constant learning rate and constant momentum weight.
- Score: 0.6906005491572401
- License:
- Abstract: Stochastic gradient descent with momentum (SGDM), which is defined by adding a momentum term to SGD, has been well studied in both theory and practice. Theoretically investigated results showed that the settings of the learning rate and momentum weight affect the convergence of SGDM. Meanwhile, practical results showed that the setting of batch size strongly depends on the performance of SGDM. In this paper, we focus on mini-batch SGDM with constant learning rate and constant momentum weight, which is frequently used to train deep neural networks in practice. The contribution of this paper is showing theoretically that using a constant batch size does not always minimize the expectation of the full gradient norm of the empirical loss in training a deep neural network, whereas using an increasing batch size definitely minimizes it, that is, increasing batch size improves convergence of mini-batch SGDM. We also provide numerical results supporting our analyses, indicating specifically that mini-batch SGDM with an increasing batch size converges to stationary points faster than with a constant batch size. Python implementations of the optimizers used in the numerical experiments are available at https://anonymous.4open.science/r/momentum-increasing-batch-size-888C/.
Related papers
- Faster Convergence of Riemannian Stochastic Gradient Descent with Increasing Batch Size [0.6906005491572401]
Using an increasing batch size leads to faster RSGD than using a constant batch size.
Experiments on principal component analysis and low-rank matrix problems confirmed that, using a growth batch size or an exponential growth batch size results in better performance than using a constant batch size.
arXiv Detail & Related papers (2025-01-30T06:23:28Z) - When and Why Momentum Accelerates SGD:An Empirical Study [76.2666927020119]
This study examines the performance of gradient descent (SGD) with momentum (SGDM)
We find that the momentum acceleration is closely related to emphabrupt sharpening which is to describe a sudden jump of the directional Hessian along the update direction.
Momentum improves the performance of SGDM by preventing or deferring the occurrence of abrupt sharpening.
arXiv Detail & Related papers (2023-06-15T09:54:21Z) - Critical Bach Size Minimizes Stochastic First-Order Oracle Complexity of
Deep Learning Optimizer using Hyperparameters Close to One [0.0]
We show that deep learnings using small constant learning rates, hyper parameters close to one, and large batch sizes can find the model parameters of deep neural networks that minimize the loss functions.
Results indicate that Adam using a small constant learning rate, hyper parameters close to one, and the critical batch size minimizing SFO complexity has faster convergence than Momentum and gradient descent.
arXiv Detail & Related papers (2022-08-21T06:11:23Z) - Low-Precision Stochastic Gradient Langevin Dynamics [70.69923368584588]
We provide the first study of low-precision Gradient Langevin Dynamics, showing that its costs can be significantly reduced without sacrificing performance.
We develop a new quantization function for SGLD that preserves the variance in each update step.
We demonstrate that low-precision SGLD achieves comparable performance to full-precision SGLD with only 8 bits on a variety of deep learning tasks.
arXiv Detail & Related papers (2022-06-20T17:25:41Z) - Critical Parameters for Scalable Distributed Learning with Large Batches
and Asynchronous Updates [67.19481956584465]
It has been experimentally observed that the efficiency of distributed training with saturation (SGD) depends decisively on the batch size and -- in implementations -- on the staleness.
We show that our results are tight and illustrate key findings in numerical experiments.
arXiv Detail & Related papers (2021-03-03T12:08:23Z) - On the Generalization of Stochastic Gradient Descent with Momentum [58.900860437254885]
We first show that there exists a convex loss function for which algorithmic stability fails to establish generalization guarantees.
For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, and show that it admits an upper-bound on the generalization error.
For the special case of strongly convex loss functions, we find a range of momentum such that multiple epochs of standard SGDM, as a special form of SGDEM, also generalizes.
arXiv Detail & Related papers (2021-02-26T18:58:29Z) - Contrastive Weight Regularization for Large Minibatch SGD [8.927483136015283]
We introduce a novel regularization technique, namely distinctive regularization (DReg)
DReg replicates a certain layer of the deep network and encourages the parameters of both layers to be diverse.
We empirically show that optimizing the neural network with DReg using large-batch SGD achieves a significant boost in the convergence and improved performance.
arXiv Detail & Related papers (2020-11-17T22:07:38Z) - Double Forward Propagation for Memorized Batch Normalization [68.34268180871416]
Batch Normalization (BN) has been a standard component in designing deep neural networks (DNNs)
We propose a memorized batch normalization (MBN) which considers multiple recent batches to obtain more accurate and robust statistics.
Compared to related methods, the proposed MBN exhibits consistent behaviors in both training and inference.
arXiv Detail & Related papers (2020-10-10T08:48:41Z) - Stochastic Normalized Gradient Descent with Momentum for Large-Batch Training [9.964630991617764]
gradient descent(SGD) and its variants have been the dominating optimization methods in machine learning.
In this paper, we propose a simple yet effective method, called normalized gradient descent with momentum(SNGM) for largebatch training.
arXiv Detail & Related papers (2020-07-28T04:34:43Z) - On the Generalization Benefit of Noise in Stochastic Gradient Descent [34.127525925676416]
It has long been argued that minibatch gradient descent can generalize better than large batch gradient descent in deep neural networks.
We show that small or moderately large batch sizes can substantially outperform very large batches on the test set.
arXiv Detail & Related papers (2020-06-26T16:18:54Z) - On the Generalization of Stochastic Gradient Descent with Momentum [84.54924994010703]
momentum-based accelerated variants of gradient descent (SGD) are widely used when training machine learning models.
We first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded.
For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM) under a broad range of step-sizes.
arXiv Detail & Related papers (2018-09-12T17:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.