The Role of Momentum Parameters in the Optimal Convergence of Adaptive
Polyak's Heavy-ball Methods
- URL: http://arxiv.org/abs/2102.07314v1
- Date: Mon, 15 Feb 2021 02:57:14 GMT
- Title: The Role of Momentum Parameters in the Optimal Convergence of Adaptive
Polyak's Heavy-ball Methods
- Authors: Wei Tao, Sheng Long, Gaowei Wu, Qing Tao
- Abstract summary: We prove that the adaptive Polyak's Heavy-ball (HB) method attains an optimal individual convergence rate of $O(frac1sqrtt)$.
Our new analysis shows how the HB momentum and its time-varying weight help us to achieve the acceleration in convex optimization.
- Score: 12.93796690939018
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The adaptive stochastic gradient descent (SGD) with momentum has been widely
adopted in deep learning as well as convex optimization. In practice, the last
iterate is commonly used as the final solution to make decisions. However, the
available regret analysis and the setting of constant momentum parameters only
guarantee the optimal convergence of the averaged solution. In this paper, we
fill this theory-practice gap by investigating the convergence of the last
iterate (referred to as individual convergence), which is a more difficult task
than convergence analysis of the averaged solution. Specifically, in the
constrained convex cases, we prove that the adaptive Polyak's Heavy-ball (HB)
method, in which only the step size is updated using the exponential moving
average strategy, attains an optimal individual convergence rate of
$O(\frac{1}{\sqrt{t}})$, as opposed to the optimality of $O(\frac{\log t}{\sqrt
{t}})$ of SGD, where $t$ is the number of iterations. Our new analysis not only
shows how the HB momentum and its time-varying weight help us to achieve the
acceleration in convex optimization but also gives valuable hints how the
momentum parameters should be scheduled in deep learning. Empirical results on
optimizing convex functions and training deep networks validate the correctness
of our convergence analysis and demonstrate the improved performance of the
adaptive HB methods.
Related papers
- Shuffling Momentum Gradient Algorithm for Convex Optimization [22.58278411628813]
TheTranart Gradient Descent method (SGD) and its variants have become methods choice for solving finite-sum optimization problems from machine learning and data science.
We provide the first analysis of shuffling momentum-based methods for the strongly setting.
arXiv Detail & Related papers (2024-03-05T18:19:02Z) - Formal guarantees for heuristic optimization algorithms used in machine
learning [6.978625807687497]
Gradient Descent (SGD) and its variants have become the dominant methods in the large-scale optimization machine learning (ML) problems.
We provide formal guarantees of a few convex optimization methods and proposing improved algorithms.
arXiv Detail & Related papers (2022-07-31T19:41:22Z) - Improved Convergence Rate of Stochastic Gradient Langevin Dynamics with
Variance Reduction and its Application to Optimization [50.83356836818667]
gradient Langevin Dynamics is one of the most fundamental algorithms to solve non-eps optimization problems.
In this paper, we show two variants of this kind, namely the Variance Reduced Langevin Dynamics and the Recursive Gradient Langevin Dynamics.
arXiv Detail & Related papers (2022-03-30T11:39:00Z) - Momentum Accelerates the Convergence of Stochastic AUPRC Maximization [80.8226518642952]
We study optimization of areas under precision-recall curves (AUPRC), which is widely used for imbalanced tasks.
We develop novel momentum methods with a better iteration of $O (1/epsilon4)$ for finding an $epsilon$stationary solution.
We also design a novel family of adaptive methods with the same complexity of $O (1/epsilon4)$, which enjoy faster convergence in practice.
arXiv Detail & Related papers (2021-07-02T16:21:52Z) - On the Convergence of Stochastic Extragradient for Bilinear Games with
Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence.
We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z) - Gradient Descent Averaging and Primal-dual Averaging for Strongly Convex
Optimization [15.731908248435348]
We develop gradient descent averaging and primal-dual averaging algorithms for strongly convex cases.
We prove that primal-dual averaging yields the optimal convergence rate in terms of output averaging, while SC-PDA derives the optimal individual convergence.
Several experiments on SVMs and deep learning models validate the correctness of theoretical analysis and effectiveness of algorithms.
arXiv Detail & Related papers (2020-12-29T01:40:30Z) - Zeroth-Order Hybrid Gradient Descent: Towards A Principled Black-Box
Optimization Framework [100.36569795440889]
This work is on the iteration of zero-th-order (ZO) optimization which does not require first-order information.
We show that with a graceful design in coordinate importance sampling, the proposed ZO optimization method is efficient both in terms of complexity as well as as function query cost.
arXiv Detail & Related papers (2020-12-21T17:29:58Z) - Balancing Rates and Variance via Adaptive Batch-Size for Stochastic
Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error.
Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z) - Convergence of adaptive algorithms for weakly convex constrained
optimization [59.36386973876765]
We prove the $mathcaltilde O(t-1/4)$ rate of convergence for the norm of the gradient of Moreau envelope.
Our analysis works with mini-batch size of $1$, constant first and second order moment parameters, and possibly smooth optimization domains.
arXiv Detail & Related papers (2020-06-11T17:43:19Z) - The Strength of Nesterov's Extrapolation in the Individual Convergence
of Nonsmooth Optimization [0.0]
We prove that Nesterov's extrapolation has the strength to make the individual convergence of gradient descent methods optimal for nonsmooth problems.
We give an extension of the derived algorithms to solve regularized learning tasks with nonsmooth losses in settings.
Our method is applicable as an efficient tool for solving large-scale $l$1-regularized hinge-loss learning problems.
arXiv Detail & Related papers (2020-06-08T03:35:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.