The Strength of Nesterov's Extrapolation in the Individual Convergence
of Nonsmooth Optimization
- URL: http://arxiv.org/abs/2006.04340v1
- Date: Mon, 8 Jun 2020 03:35:41 GMT
- Title: The Strength of Nesterov's Extrapolation in the Individual Convergence
of Nonsmooth Optimization
- Authors: W. Tao, Z. Pan, G. Wu, and Q. Tao
- Abstract summary: We prove that Nesterov's extrapolation has the strength to make the individual convergence of gradient descent methods optimal for nonsmooth problems.
We give an extension of the derived algorithms to solve regularized learning tasks with nonsmooth losses in settings.
Our method is applicable as an efficient tool for solving large-scale $l$1-regularized hinge-loss learning problems.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The extrapolation strategy raised by Nesterov, which can accelerate the
convergence rate of gradient descent methods by orders of magnitude when
dealing with smooth convex objective, has led to tremendous success in training
machine learning tasks. In this article, the convergence of individual iterates
of projected subgradient (PSG) methods for nonsmooth convex optimization
problems is theoretically studied based on Nesterov's extrapolation, which we
name individual convergence. We prove that Nesterov's extrapolation has the
strength to make the individual convergence of PSG optimal for nonsmooth
problems. In light of this consideration, a direct modification of the
subgradient evaluation suffices to achieve optimal individual convergence for
strongly convex problems, which can be regarded as making an interesting step
toward the open question about stochastic gradient descent (SGD) posed by
Shamir. Furthermore, we give an extension of the derived algorithms to solve
regularized learning tasks with nonsmooth losses in stochastic settings.
Compared with other state-of-the-art nonsmooth methods, the derived algorithms
can serve as an alternative to the basic SGD especially in coping with machine
learning problems, where an individual output is needed to guarantee the
regularization structure while keeping an optimal rate of convergence.
Typically, our method is applicable as an efficient tool for solving
large-scale $l$1-regularized hinge-loss learning problems. Several comparison
experiments demonstrate that our individual output not only achieves an optimal
convergence rate but also guarantees better sparsity than the averaged
solution.
Related papers
- Dealing with unbounded gradients in stochastic saddle-point optimization [9.983014605039658]
We study the performance of first-order methods for finding saddle points of convex-concave functions.
A notorious challenge is that the gradients can grow arbitrarily large during optimization.
We propose a simple and effective regularization technique that stabilizes the iterates and yields meaningful performance guarantees.
arXiv Detail & Related papers (2024-02-21T16:13:49Z) - Low-Rank Extragradient Methods for Scalable Semidefinite Optimization [0.0]
We focus on high-dimensional and plausible settings in which the problem admits a low-rank solution.
We provide several theoretical results proving that, under these circumstances, the well-known Extragradient method converges to a solution of the constrained optimization problem.
arXiv Detail & Related papers (2024-02-14T10:48:00Z) - Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z) - Linearization Algorithms for Fully Composite Optimization [61.20539085730636]
This paper studies first-order algorithms for solving fully composite optimization problems convex compact sets.
We leverage the structure of the objective by handling differentiable and non-differentiable separately, linearizing only the smooth parts.
arXiv Detail & Related papers (2023-02-24T18:41:48Z) - Versatile Single-Loop Method for Gradient Estimator: First and Second
Order Optimality, and its Application to Federated Learning [45.78238792836363]
We present a single-loop algorithm named SLEDGE (Single-Loop-E Gradient Estimator) for periodic convergence.
Unlike existing methods, SLEDGE has the advantage of versatility; (ii) second-order optimal, (ii) in the PL region, and (iii) smaller complexity under less of data.
arXiv Detail & Related papers (2022-09-01T11:05:26Z) - Faster Algorithm and Sharper Analysis for Constrained Markov Decision
Process [56.55075925645864]
The problem of constrained decision process (CMDP) is investigated, where an agent aims to maximize the expected accumulated discounted reward subject to multiple constraints.
A new utilities-dual convex approach is proposed with novel integration of three ingredients: regularized policy, dual regularizer, and Nesterov's gradient descent dual.
This is the first demonstration that nonconcave CMDP problems can attain the lower bound of $mathcal O (1/epsilon)$ for all complexity optimization subject to convex constraints.
arXiv Detail & Related papers (2021-10-20T02:57:21Z) - High Probability Complexity Bounds for Non-Smooth Stochastic Optimization with Heavy-Tailed Noise [51.31435087414348]
It is essential to theoretically guarantee that algorithms provide small objective residual with high probability.
Existing methods for non-smooth convex optimization have complexity bounds with dependence on confidence level.
We propose novel stepsize rules for two methods with gradient clipping.
arXiv Detail & Related papers (2021-06-10T17:54:21Z) - Gradient Descent Averaging and Primal-dual Averaging for Strongly Convex
Optimization [15.731908248435348]
We develop gradient descent averaging and primal-dual averaging algorithms for strongly convex cases.
We prove that primal-dual averaging yields the optimal convergence rate in terms of output averaging, while SC-PDA derives the optimal individual convergence.
Several experiments on SVMs and deep learning models validate the correctness of theoretical analysis and effectiveness of algorithms.
arXiv Detail & Related papers (2020-12-29T01:40:30Z) - Zeroth-Order Hybrid Gradient Descent: Towards A Principled Black-Box
Optimization Framework [100.36569795440889]
This work is on the iteration of zero-th-order (ZO) optimization which does not require first-order information.
We show that with a graceful design in coordinate importance sampling, the proposed ZO optimization method is efficient both in terms of complexity as well as as function query cost.
arXiv Detail & Related papers (2020-12-21T17:29:58Z) - Convergence of adaptive algorithms for weakly convex constrained
optimization [59.36386973876765]
We prove the $mathcaltilde O(t-1/4)$ rate of convergence for the norm of the gradient of Moreau envelope.
Our analysis works with mini-batch size of $1$, constant first and second order moment parameters, and possibly smooth optimization domains.
arXiv Detail & Related papers (2020-06-11T17:43:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.