Relationship between Batch Size and Number of Steps Needed for Nonconvex
Optimization of Stochastic Gradient Descent using Armijo Line Search
- URL: http://arxiv.org/abs/2307.13831v4
- Date: Thu, 1 Feb 2024 14:14:56 GMT
- Title: Relationship between Batch Size and Number of Steps Needed for Nonconvex
Optimization of Stochastic Gradient Descent using Armijo Line Search
- Authors: Yuki Tsukada, Hideaki Iiduka
- Abstract summary: We show that SGD performs better than other deep learning networks when it uses deep numerical line.
The results indicate that the number of steps needed for SFO as the batch size grows can be estimated.
- Score: 0.8158530638728501
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While stochastic gradient descent (SGD) can use various learning rates, such
as constant or diminishing rates, the previous numerical results showed that
SGD performs better than other deep learning optimizers using when it uses
learning rates given by line search methods. In this paper, we perform a
convergence analysis on SGD with a learning rate given by an Armijo line search
for nonconvex optimization indicating that the upper bound of the expectation
of the squared norm of the full gradient becomes small when the number of steps
and the batch size are large. Next, we show that, for SGD with the
Armijo-line-search learning rate, the number of steps needed for nonconvex
optimization is a monotone decreasing convex function of the batch size; that
is, the number of steps needed for nonconvex optimization decreases as the
batch size increases. Furthermore, we show that the stochastic first-order
oracle (SFO) complexity, which is the stochastic gradient computation cost, is
a convex function of the batch size; that is, there exists a critical batch
size that minimizes the SFO complexity. Finally, we provide numerical results
that support our theoretical results. The numerical results indicate that the
number of steps needed for training deep neural networks decreases as the batch
size increases and that there exist the critical batch sizes that can be
estimated from the theoretical results.
Related papers
- Iteration and Stochastic First-order Oracle Complexities of Stochastic
Gradient Descent using Constant and Decaying Learning Rates [0.8158530638728501]
We show that the performance of descent (SGD) depends on not only the learning rate but also the batch size.
We show that measured critical batch sizes are close to the sizes estimated from our theoretical results.
arXiv Detail & Related papers (2024-02-23T14:24:45Z) - Ordering for Non-Replacement SGD [7.11967773739707]
We seek to find an ordering that can improve the convergence rates for the non-replacement form of the algorithm.
We develop optimal orderings for constant and decreasing step sizes for strongly convex and convex functions.
In addition, we are able to combine the ordering with mini-batch and further apply it to more complex neural networks.
arXiv Detail & Related papers (2023-06-28T00:46:58Z) - Convergence of the mini-batch SIHT algorithm [0.0]
The Iterative Hard Thresholding (IHT) algorithm has been considered extensively as an effective deterministic algorithm for solving sparse optimizations.
We show that the sequence generated by the sparse mini-batch SIHT is a supermartingale sequence and converges with probability one.
arXiv Detail & Related papers (2022-09-29T03:47:46Z) - Critical Bach Size Minimizes Stochastic First-Order Oracle Complexity of
Deep Learning Optimizer using Hyperparameters Close to One [0.0]
We show that deep learnings using small constant learning rates, hyper parameters close to one, and large batch sizes can find the model parameters of deep neural networks that minimize the loss functions.
Results indicate that Adam using a small constant learning rate, hyper parameters close to one, and the critical batch size minimizing SFO complexity has faster convergence than Momentum and gradient descent.
arXiv Detail & Related papers (2022-08-21T06:11:23Z) - Faster One-Sample Stochastic Conditional Gradient Method for Composite
Convex Minimization [61.26619639722804]
We propose a conditional gradient method (CGM) for minimizing convex finite-sum objectives formed as a sum of smooth and non-smooth terms.
The proposed method, equipped with an average gradient (SAG) estimator, requires only one sample per iteration. Nevertheless, it guarantees fast convergence rates on par with more sophisticated variance reduction techniques.
arXiv Detail & Related papers (2022-02-26T19:10:48Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - Zeroth-Order Hybrid Gradient Descent: Towards A Principled Black-Box
Optimization Framework [100.36569795440889]
This work is on the iteration of zero-th-order (ZO) optimization which does not require first-order information.
We show that with a graceful design in coordinate importance sampling, the proposed ZO optimization method is efficient both in terms of complexity as well as as function query cost.
arXiv Detail & Related papers (2020-12-21T17:29:58Z) - Balancing Rates and Variance via Adaptive Batch-Size for Stochastic
Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error.
Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z) - Exploiting Higher Order Smoothness in Derivative-free Optimization and
Continuous Bandits [99.70167985955352]
We study the problem of zero-order optimization of a strongly convex function.
We consider a randomized approximation of the projected gradient descent algorithm.
Our results imply that the zero-order algorithm is nearly optimal in terms of sample complexity and the problem parameters.
arXiv Detail & Related papers (2020-06-14T10:42:23Z) - Variance Reduction with Sparse Gradients [82.41780420431205]
Variance reduction methods such as SVRG and SpiderBoost use a mixture of large and small batch gradients.
We introduce a new sparsity operator: The random-top-k operator.
Our algorithm consistently outperforms SpiderBoost on various tasks including image classification, natural language processing, and sparse matrix factorization.
arXiv Detail & Related papers (2020-01-27T08:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.