Proximal Gradient Descent-Ascent: Variable Convergence under K{\L}
Geometry
- URL: http://arxiv.org/abs/2102.04653v1
- Date: Tue, 9 Feb 2021 05:35:53 GMT
- Title: Proximal Gradient Descent-Ascent: Variable Convergence under K{\L}
Geometry
- Authors: Ziyi Chen, Yi Zhou, Tengyu Xu, Yingbin Liang
- Abstract summary: The finite descent-ascent parameters (GDA) has been widely applied to solve minimax optimization problems.
This paper fills such a gap by studying the convergence of the KL-Lized geometry.
- Score: 49.65455534654459
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The gradient descent-ascent (GDA) algorithm has been widely applied to solve
minimax optimization problems. In order to achieve convergent policy parameters
for minimax optimization, it is important that GDA generates convergent
variable sequences rather than convergent sequences of function values or
gradient norms. However, the variable convergence of GDA has been proved only
under convexity geometries, and there lacks understanding for general nonconvex
minimax optimization. This paper fills such a gap by studying the convergence
of a more general proximal-GDA for regularized nonconvex-strongly-concave
minimax optimization. Specifically, we show that proximal-GDA admits a novel
Lyapunov function, which monotonically decreases in the minimax optimization
process and drives the variable sequence to a critical point. By leveraging
this Lyapunov function and the K{\L} geometry that parameterizes the local
geometries of general nonconvex functions, we formally establish the variable
convergence of proximal-GDA to a critical point $x^*$, i.e., $x_t\to x^*,
y_t\to y^*(x^*)$. Furthermore, over the full spectrum of the
K{\L}-parameterized geometry, we show that proximal-GDA achieves different
types of convergence rates ranging from sublinear convergence up to finite-step
convergence, depending on the geometry associated with the K{\L} parameter.
This is the first theoretical result on the variable convergence for nonconvex
minimax optimization.
Related papers
- On Convergence of Incremental Gradient for Non-Convex Smooth Functions [63.51187646914962]
In machine learning and network optimization, algorithms like shuffle SGD are popular due to minimizing the number of misses and good cache.
This paper delves into the convergence properties SGD algorithms with arbitrary data ordering.
arXiv Detail & Related papers (2023-05-30T17:47:27Z) - On the Convergence of AdaGrad(Norm) on $\R^{d}$: Beyond Convexity,
Non-Asymptotic Rate and Acceleration [33.247600151322466]
We develop a deeper understanding of AdaGrad and its variants in the standard setting of smooth convex functions.
First, we demonstrate new techniques to explicitly bound the convergence rate of the vanilla AdaGrad for unconstrained problems.
Second, we propose a variant of AdaGrad for which we can show the convergence of the last iterate, instead of the average iterate.
arXiv Detail & Related papers (2022-09-29T14:44:40Z) - Randomized Coordinate Subgradient Method for Nonsmooth Composite
Optimization [11.017632675093628]
Coordinate-type subgradient methods for addressing nonsmooth problems are relatively underexplored due to the set of properties of the Lipschitz-type assumption.
arXiv Detail & Related papers (2022-06-30T02:17:11Z) - Escaping Saddle Points in Nonconvex Minimax Optimization via
Cubic-Regularized Gradient Descent-Ascent [13.565010494569243]
The gradient descent-ascent (GDA) algorithm has been widely applied to solve non minimax optimization problems.
We develop the first-type GDA algorithm for escaping points in non-strongly-concave minimax optimization.
arXiv Detail & Related papers (2021-10-14T00:36:44Z) - Last iterate convergence of SGD for Least-Squares in the Interpolation
regime [19.05750582096579]
We study the noiseless model in the fundamental least-squares setup.
We assume that an optimum predictor fits perfectly inputs and outputs $langle theta_*, phi(X) rangle = Y$, where $phi(X)$ stands for a possibly infinite dimensional non-linear feature map.
arXiv Detail & Related papers (2021-02-05T14:02:20Z) - Gradient Free Minimax Optimization: Variance Reduction and Faster
Convergence [120.9336529957224]
In this paper, we denote the non-strongly setting on the magnitude of a gradient-free minimax optimization problem.
We show that a novel zeroth-order variance reduced descent algorithm achieves the best known query complexity.
arXiv Detail & Related papers (2020-06-16T17:55:46Z) - Convergence of adaptive algorithms for weakly convex constrained
optimization [59.36386973876765]
We prove the $mathcaltilde O(t-1/4)$ rate of convergence for the norm of the gradient of Moreau envelope.
Our analysis works with mini-batch size of $1$, constant first and second order moment parameters, and possibly smooth optimization domains.
arXiv Detail & Related papers (2020-06-11T17:43:19Z) - The Convergence Indicator: Improved and completely characterized
parameter bounds for actual convergence of Particle Swarm Optimization [68.8204255655161]
We introduce a new convergence indicator that can be used to calculate whether the particles will finally converge to a single point or diverge.
Using this convergence indicator we provide the actual bounds completely characterizing parameter regions that lead to a converging swarm.
arXiv Detail & Related papers (2020-06-06T19:08:05Z) - S-ADDOPT: Decentralized stochastic first-order optimization over
directed graphs [16.96562173221624]
Decentralized convex optimization is proposed to minimize a sum of smooth and strongly cost functions when the functions are distributed over a directed network nodes.
In particular, we propose thetextbftextttS-ADDOPT algorithm that assumes a first-order oracle at each node.
For decaying step-sizes$mathcalO (1/k)$, we show thattextbfttS-ADDOPT reaches the exact solution subly at$mathcalO (1/k)$ and its convergence is networkally-independent
arXiv Detail & Related papers (2020-05-15T21:14:22Z) - A Simple Convergence Proof of Adam and Adagrad [74.24716715922759]
We show a proof of convergence between the Adam Adagrad and $O(d(N)/st)$ algorithms.
Adam converges with the same convergence $O(d(N)/st)$ when used with the default parameters.
arXiv Detail & Related papers (2020-03-05T01:56:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.