On Convergence of Gradient Descent Ascent: A Tight Local Analysis
- URL: http://arxiv.org/abs/2207.00957v1
- Date: Sun, 3 Jul 2022 05:04:46 GMT
- Title: On Convergence of Gradient Descent Ascent: A Tight Local Analysis
- Authors: Haochuan Li, Farzan Farnia, Subhro Das, Ali Jadbabaie
- Abstract summary: Gradient Descent Ascent (GDA) methods are the mainstream algorithms for minimax optimization in generative adversarial networks (GANs)
We prove the convergence of GDA with a stepsize ratioeta_mathbfx$/eta_mathbfx=Theta(kappa2)
We further extend the convergence guarantees to GDA and extra-gradient methods (EG)
- Score: 30.206431787797776
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Gradient Descent Ascent (GDA) methods are the mainstream algorithms for
minimax optimization in generative adversarial networks (GANs). Convergence
properties of GDA have drawn significant interest in the recent literature.
Specifically, for $\min_{\mathbf{x}} \max_{\mathbf{y}}
f(\mathbf{x};\mathbf{y})$ where $f$ is strongly-concave in $\mathbf{y}$ and
possibly nonconvex in $\mathbf{x}$, (Lin et al., 2020) proved the convergence
of GDA with a stepsize ratio
$\eta_{\mathbf{y}}/\eta_{\mathbf{x}}=\Theta(\kappa^2)$ where
$\eta_{\mathbf{x}}$ and $\eta_{\mathbf{y}}$ are the stepsizes for $\mathbf{x}$
and $\mathbf{y}$ and $\kappa$ is the condition number for $\mathbf{y}$. While
this stepsize ratio suggests a slow training of the min player, practical GAN
algorithms typically adopt similar stepsizes for both variables, indicating a
wide gap between theoretical and empirical results. In this paper, we aim to
bridge this gap by analyzing the \emph{local convergence} of general
\emph{nonconvex-nonconcave} minimax problems. We demonstrate that a stepsize
ratio of $\Theta(\kappa)$ is necessary and sufficient for local convergence of
GDA to a Stackelberg Equilibrium, where $\kappa$ is the local condition number
for $\mathbf{y}$. We prove a nearly tight convergence rate with a matching
lower bound. We further extend the convergence guarantees to stochastic GDA and
extra-gradient methods (EG). Finally, we conduct several numerical experiments
to support our theoretical findings.
Related papers
- Improved convergence rate of kNN graph Laplacians [11.93971616098517]
General class of $k$NN graph where the graph affinity is $W_ij = epsilon-d/2 .
We prove the point-wise convergence of the $k$NN graph Laplacian to the limiting manifold operator.
arXiv Detail & Related papers (2024-10-30T17:01:00Z) - Two-Timescale Gradient Descent Ascent Algorithms for Nonconvex Minimax Optimization [77.3396841985172]
We provide a unified analysis of two-timescale gradient ascent (TTGDA) for solving structured non minimax optimization problems.
Our contribution is to design TTGDA algorithms are effective beyond the setting.
arXiv Detail & Related papers (2024-08-21T20:14:54Z) - Provably learning a multi-head attention layer [55.2904547651831]
Multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models.
In this work, we initiate the study of provably learning a multi-head attention layer from random examples.
We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable.
arXiv Detail & Related papers (2024-02-06T15:39:09Z) - Near Optimal Heteroscedastic Regression with Symbiotic Learning [29.16456701187538]
We consider the problem of heteroscedastic linear regression.
We can estimate $mathbfw*$ in squared norm up to an error of $tildeOleft(|mathbff*|2cdot left(frac1n + left(dnright)2right)$ and prove a matching lower bound.
arXiv Detail & Related papers (2023-06-25T16:32:00Z) - Convergence of a Normal Map-based Prox-SGD Method under the KL
Inequality [0.0]
We present a novel map-based algorithm ($mathsfnorMtext-mathsfSGD$) for $symbol$k$ convergence problems.
arXiv Detail & Related papers (2023-05-10T01:12:11Z) - Randomized Block-Coordinate Optimistic Gradient Algorithms for
Root-Finding Problems [8.0153031008486]
We develop two new algorithms to approximate a solution of nonlinear equations in large-scale settings.
We apply our methods to a class of large-scale finite-sum inclusions, which covers prominent applications in machine learning.
arXiv Detail & Related papers (2023-01-08T21:46:27Z) - Learning a Single Neuron with Adversarial Label Noise via Gradient
Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations.
The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z) - Fast Graph Sampling for Short Video Summarization using Gershgorin Disc
Alignment [52.577757919003844]
We study the problem of efficiently summarizing a short video into several paragraphs, leveraging recent progress in fast graph sampling.
Experimental results show that our algorithm achieves comparable video summarization as state-of-the-art methods, at a substantially reduced complexity.
arXiv Detail & Related papers (2021-10-21T18:43:00Z) - Near-Optimal Algorithms for Minimax Optimization [115.21519161773287]
The paper presents the first with $tildeO(sqrtkappa_mathbf xkappa_mathbf)$, matching the design on logarithmic factors.
The paper also presents algorithms that match or outperform all existing methods in these settings in terms of complexity.
arXiv Detail & Related papers (2020-02-05T16:49:09Z) - On Gradient Descent Ascent for Nonconvex-Concave Minimax Problems [86.92205445270427]
We consider non-con minimax problems, $min_mathbfx max_mathhidoty f(mathbfdoty)$ efficiently.
arXiv Detail & Related papers (2019-06-02T03:03:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.