Formal guarantees for heuristic optimization algorithms used in machine
learning
- URL: http://arxiv.org/abs/2208.00502v1
- Date: Sun, 31 Jul 2022 19:41:22 GMT
- Title: Formal guarantees for heuristic optimization algorithms used in machine
learning
- Authors: Xiaoyu Li
- Abstract summary: Gradient Descent (SGD) and its variants have become the dominant methods in the large-scale optimization machine learning (ML) problems.
We provide formal guarantees of a few convex optimization methods and proposing improved algorithms.
- Score: 6.978625807687497
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Recently, Stochastic Gradient Descent (SGD) and its variants have become the
dominant methods in the large-scale optimization of machine learning (ML)
problems. A variety of strategies have been proposed for tuning the step sizes,
ranging from adaptive step sizes to heuristic methods to change the step size
in each iteration. Also, momentum has been widely employed in ML tasks to
accelerate the training process. Yet, there is a gap in our theoretical
understanding of them. In this work, we start to close this gap by providing
formal guarantees to a few heuristic optimization methods and proposing
improved algorithms.
First, we analyze a generalized version of the AdaGrad (Delayed AdaGrad) step
sizes in both convex and non-convex settings, showing that these step sizes
allow the algorithms to automatically adapt to the level of noise of the
stochastic gradients. We show for the first time sufficient conditions for
Delayed AdaGrad to achieve almost sure convergence of the gradients to zero.
Moreover, we present a high probability analysis for Delayed AdaGrad and its
momentum variant in the non-convex setting.
Second, we analyze SGD with exponential and cosine step sizes, which are
empirically successful but lack theoretical support. We provide the very first
convergence guarantees for them in the smooth and non-convex setting, with and
without the Polyak-{\L}ojasiewicz (PL) condition. We also show their good
property of adaptivity to noise under the PL condition.
Third, we study the last iterate of momentum methods. We prove the first
lower bound in the convex setting for the last iterate of SGD with constant
momentum. Moreover, we investigate a class of
Follow-The-Regularized-Leader-based momentum algorithms with increasing
momentum and shrinking updates. We show that their last iterate has optimal
convergence for unconstrained convex stochastic optimization problems.
Related papers
- Directional Smoothness and Gradient Methods: Convergence and Adaptivity [16.779513676120096]
We develop new sub-optimality bounds for gradient descent that depend on the conditioning of the objective along the path of optimization.
Key to our proofs is directional smoothness, a measure of gradient variation that we use to develop upper-bounds on the objective.
We prove that the Polyak step-size and normalized GD obtain fast, path-dependent rates despite using no knowledge of the directional smoothness.
arXiv Detail & Related papers (2024-03-06T22:24:05Z) - Improved Convergence Rate of Stochastic Gradient Langevin Dynamics with
Variance Reduction and its Application to Optimization [50.83356836818667]
gradient Langevin Dynamics is one of the most fundamental algorithms to solve non-eps optimization problems.
In this paper, we show two variants of this kind, namely the Variance Reduced Langevin Dynamics and the Recursive Gradient Langevin Dynamics.
arXiv Detail & Related papers (2022-03-30T11:39:00Z) - A Two-Time-Scale Stochastic Optimization Framework with Applications in Control and Reinforcement Learning [13.908826484332282]
We study a new two-time-scale gradient method for solving optimization problems.
Our first contribution is to characterize the finite-time complexity of the proposed two-time-scale gradient algorithm.
We apply our framework to gradient-based policy evaluation algorithms in reinforcement learning.
arXiv Detail & Related papers (2021-09-29T23:15:23Z) - Dynamics of Stochastic Momentum Methods on Large-scale, Quadratic Models [0.2741266294612776]
We analyze a class of gradient algorithms with momentum on a high-dimensional random least squares problem.
We show that (small-batch) momentum with a fixed momentum parameter provides no actual performance improvement over SGD when step sizes are adjusted correctly.
In the non-strongly convex setting, it is possible to get a large improvement over SGD using momentum.
arXiv Detail & Related papers (2021-06-07T15:08:24Z) - Unified Convergence Analysis for Adaptive Optimization with Moving Average Estimator [75.05106948314956]
We show that an increasing large momentum parameter for the first-order moment is sufficient for adaptive scaling.
We also give insights for increasing the momentum in a stagewise manner in accordance with stagewise decreasing step size.
arXiv Detail & Related papers (2021-04-30T08:50:24Z) - A Momentum-Assisted Single-Timescale Stochastic Approximation Algorithm
for Bilevel Optimization [112.59170319105971]
We propose a new algorithm -- the Momentum- Single-timescale Approximation (MSTSA) -- for tackling problems.
MSTSA allows us to control the error in iterations due to inaccurate solution to the lower level subproblem.
arXiv Detail & Related papers (2021-02-15T07:10:33Z) - A Unified Analysis of First-Order Methods for Smooth Games via Integral
Quadratic Constraints [10.578409461429626]
In this work, we adapt the integral quadratic constraints theory to first-order methods for smooth and strongly-varying games and iteration.
We provide emphfor the first time a global convergence rate for the negative momentum method(NM) with an complexity $mathcalO(kappa1.5)$, which matches its known lower bound.
We show that it is impossible for an algorithm with one step of memory to achieve acceleration if it only queries the gradient once per batch.
arXiv Detail & Related papers (2020-09-23T20:02:00Z) - Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth
Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step.
Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z) - Balancing Rates and Variance via Adaptive Batch-Size for Stochastic
Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error.
Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.