AdaBatchGrad: Combining Adaptive Batch Size and Adaptive Step Size
- URL: http://arxiv.org/abs/2402.05264v1
- Date: Wed, 7 Feb 2024 21:19:05 GMT
- Title: AdaBatchGrad: Combining Adaptive Batch Size and Adaptive Step Size
- Authors: Petr Ostroukhov, Aigerim Zhumabayeva, Chulu Xiang, Alexander Gasnikov,
Martin Tak\'a\v{c}, Dmitry Kamzolov
- Abstract summary: We present a novel adaptation of the Gradient Descent (SGD) called AdaBatchGrad.
It seamlessly integrates an adaptive step size with an adjustable batch size.
We experimentally show, how the introduction of adaptive step size and adaptive batch size gradually improves the performance of regular SGD.
- Score: 42.84471753630676
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a novel adaptation of the Stochastic Gradient Descent
(SGD), termed AdaBatchGrad. This modification seamlessly integrates an adaptive
step size with an adjustable batch size. An increase in batch size and a
decrease in step size are well-known techniques to tighten the area of
convergence of SGD and decrease its variance. A range of studies by R. Byrd and
J. Nocedal introduced various testing techniques to assess the quality of
mini-batch gradient approximations and choose the appropriate batch sizes at
every step. Methods that utilized exact tests were observed to converge within
$O(LR^2/\varepsilon)$ iterations. Conversely, inexact test implementations
sometimes resulted in non-convergence and erratic performance. To address these
challenges, AdaBatchGrad incorporates both adaptive batch and step sizes,
enhancing the method's robustness and stability. For exact tests, our approach
converges in $O(LR^2/\varepsilon)$ iterations, analogous to standard gradient
descent. For inexact tests, it achieves convergence in $O(\max\lbrace
LR^2/\varepsilon, \sigma^2 R^2/\varepsilon^2 \rbrace )$ iterations. This makes
AdaBatchGrad markedly more robust and computationally efficient relative to
prevailing methods. To substantiate the efficacy of our method, we
experimentally show, how the introduction of adaptive step size and adaptive
batch size gradually improves the performance of regular SGD. The results imply
that AdaBatchGrad surpasses alternative methods, especially when applied to
inexact tests.
Related papers
- Modified Step Size for Enhanced Stochastic Gradient Descent: Convergence
and Experiments [0.0]
This paper introduces a novel approach to the performance of the gradient descent (SGD) algorithm by incorporating a modified decay step size based on $frac1sqrttt.
The proposed step size integrates a logarithmic step term, leading to the selection of smaller values in the final iteration.
To the effectiveness of our approach, we conducted numerical experiments on image classification tasks using the FashionMNIST, andARAR datasets.
arXiv Detail & Related papers (2023-09-03T19:21:59Z) - Cutting Some Slack for SGD with Adaptive Polyak Stepsizes [35.024680868164445]
We consider the family of SPS (Stochastic gradient with a Polyak Stepsize) adaptive methods.
We first show that SPS and its recent variants can all be seen as extensions of the Passive-Aggressive methods applied to nonlinear problems.
We use this insight to develop new variants of the SPS method that are better suited to nonlinear models.
arXiv Detail & Related papers (2022-02-24T19:31:03Z) - Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and
Beyond [63.59034509960994]
We study shuffling-based variants: minibatch and local Random Reshuffling, which draw gradients without replacement.
For smooth functions satisfying the Polyak-Lojasiewicz condition, we obtain convergence bounds which show that these shuffling-based variants converge faster than their with-replacement counterparts.
We propose an algorithmic modification called synchronized shuffling that leads to convergence rates faster than our lower bounds in near-homogeneous settings.
arXiv Detail & Related papers (2021-10-20T02:25:25Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - AdaGDA: Faster Adaptive Gradient Descent Ascent Methods for Minimax
Optimization [104.96004056928474]
We propose a class of faster adaptive gradient descent methods for non-strongly-concave minimax problems.
We show that our method reaches a lower sample complexity of $O(kappa2.5epsilon-3)$ with the mini-batch size $O(kappa)$.
arXiv Detail & Related papers (2021-06-30T14:47:09Z) - Balancing Rates and Variance via Adaptive Batch-Size for Stochastic
Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error.
Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z) - Adaptive Gradient Methods Converge Faster with Over-Parameterization
(but you should do a line-search) [32.24244211281863]
We study a simplistic setting -- smooth, convex losses with models over- parameterized enough to interpolate the data.
We prove that AMSGrad with constant step-size and momentum converges to the minimizer at a faster $O(1/T)$ rate.
We show that these techniques improve the convergence and generalization of adaptive gradient methods across tasks.
arXiv Detail & Related papers (2020-06-11T21:23:30Z) - Carath\'eodory Sampling for Stochastic Gradient Descent [79.55586575988292]
We present an approach that is inspired by classical results of Tchakaloff and Carath'eodory about measure reduction.
We adaptively select the descent steps where the measure reduction is carried out.
We combine this with Block Coordinate Descent so that measure reduction can be done very cheaply.
arXiv Detail & Related papers (2020-06-02T17:52:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.