Adaptive Learning of the Optimal Batch Size of SGD
- URL: http://arxiv.org/abs/2005.01097v2
- Date: Fri, 19 Nov 2021 18:46:59 GMT
- Title: Adaptive Learning of the Optimal Batch Size of SGD
- Authors: Motasem Alfarra, Slavomir Hanzely, Alyazeed Albasyoni, Bernard Ghanem
and Peter Richtarik
- Abstract summary: We propose a method capable of learning the optimal batch size adaptively throughout its iterations for strongly convex and smooth functions.
Our method does this provably, and in our experiments with synthetic and real data robustly exhibits nearly optimal behaviour.
We generalize our method to several new batch strategies not considered in the literature before, including a sampling suitable for distributed implementations.
- Score: 52.50880550357175
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in the theoretical understanding of SGD led to a formula for
the optimal batch size minimizing the number of effective data passes, i.e.,
the number of iterations times the batch size. However, this formula is of no
practical value as it depends on the knowledge of the variance of the
stochastic gradients evaluated at the optimum. In this paper we design a
practical SGD method capable of learning the optimal batch size adaptively
throughout its iterations for strongly convex and smooth functions. Our method
does this provably, and in our experiments with synthetic and real data
robustly exhibits nearly optimal behaviour; that is, it works as if the optimal
batch size was known a-priori. Further, we generalize our method to several new
batch strategies not considered in the literature before, including a sampling
suitable for distributed implementations.
Related papers
- Derivative-Free Optimization via Finite Difference Approximation: An Experimental Study [1.3886390523644807]
Derivative-free optimization (DFO) is vital in solving complex optimization problems where only noisy function evaluations are available through an oracle.
Two classical iteration approaches are Kiefer-Wolfowitz (KW) and simultaneous perturbation approximation (SPSA) algorithms.
This paper conducts a comprehensive experimental comparison among these approaches.
arXiv Detail & Related papers (2024-10-31T18:07:44Z) - Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling [27.058009599819012]
We study the connection between optimal learning rates and batch sizes for Adam styles.
We prove that the optimal learning rate first rises and then falls as the batch size increases.
arXiv Detail & Related papers (2024-05-23T13:52:36Z) - Differentially Private Optimization with Sparse Gradients [60.853074897282625]
We study differentially private (DP) optimization problems under sparsity of individual gradients.
Building on this, we obtain pure- and approximate-DP algorithms with almost optimal rates for convex optimization with sparse gradients.
arXiv Detail & Related papers (2024-04-16T20:01:10Z) - An Adaptive Dimension Reduction Estimation Method for High-dimensional
Bayesian Optimization [6.79843988450982]
We propose a two-step optimization framework to extend BO to high-dimensional settings.
Our algorithm offers the flexibility to operate these steps either concurrently or in sequence.
Numerical experiments validate the efficacy of our method in challenging scenarios.
arXiv Detail & Related papers (2024-03-08T16:21:08Z) - Beyond Single-Model Views for Deep Learning: Optimization versus
Generalizability of Stochastic Optimization Algorithms [13.134564730161983]
This paper adopts a novel approach to deep learning optimization, focusing on gradient descent (SGD) and its variants.
We show that SGD and its variants demonstrate performance on par with flat-minimas like SAM, albeit with half the gradient evaluations.
Our study uncovers several key findings regarding the relationship between training loss and hold-out accuracy, as well as the comparable performance of SGD and noise-enabled variants.
arXiv Detail & Related papers (2024-03-01T14:55:22Z) - Variational Linearized Laplace Approximation for Bayesian Deep Learning [11.22428369342346]
We propose a new method for approximating Linearized Laplace Approximation (LLA) using a variational sparse Gaussian Process (GP)
Our method is based on the dual RKHS formulation of GPs and retains, as the predictive mean, the output of the original DNN.
It allows for efficient optimization, which results in sub-linear training time in the size of the training dataset.
arXiv Detail & Related papers (2023-02-24T10:32:30Z) - Sparse high-dimensional linear regression with a partitioned empirical
Bayes ECM algorithm [62.997667081978825]
We propose a computationally efficient and powerful Bayesian approach for sparse high-dimensional linear regression.
Minimal prior assumptions on the parameters are used through the use of plug-in empirical Bayes estimates.
The proposed approach is implemented in the R package probe.
arXiv Detail & Related papers (2022-09-16T19:15:50Z) - Large-scale Optimization of Partial AUC in a Range of False Positive
Rates [51.12047280149546]
The area under the ROC curve (AUC) is one of the most widely used performance measures for classification models in machine learning.
We develop an efficient approximated gradient descent method based on recent practical envelope smoothing technique.
Our proposed algorithm can also be used to minimize the sum of some ranked range loss, which also lacks efficient solvers.
arXiv Detail & Related papers (2022-03-03T03:46:18Z) - Zeroth-Order Hybrid Gradient Descent: Towards A Principled Black-Box
Optimization Framework [100.36569795440889]
This work is on the iteration of zero-th-order (ZO) optimization which does not require first-order information.
We show that with a graceful design in coordinate importance sampling, the proposed ZO optimization method is efficient both in terms of complexity as well as as function query cost.
arXiv Detail & Related papers (2020-12-21T17:29:58Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.