Negative Stepsizes Make Gradient-Descent-Ascent Converge
- URL: http://arxiv.org/abs/2505.01423v1
- Date: Fri, 02 May 2025 17:59:46 GMT
- Title: Negative Stepsizes Make Gradient-Descent-Ascent Converge
- Authors: Henry Shugart, Jason M. Altschuler,
- Abstract summary: We show that gradient-descent-ascent (GDA) converges in its original form by simply using a judicious choice of stepsizes.<n>We interpret this as a second-order finite-differencing algorithm and show that, intriguingly, it approximately implements consensus optimization.
- Score: 5.448070998907116
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Efficient computation of min-max problems is a central question in optimization, learning, games, and controls. Arguably the most natural algorithm is gradient-descent-ascent (GDA). However, since the 1970s, conventional wisdom has argued that GDA fails to converge even on simple problems. This failure spurred an extensive literature on modifying GDA with additional building blocks such as extragradients, optimism, momentum, anchoring, etc. In contrast, we show that GDA converges in its original form by simply using a judicious choice of stepsizes. The key innovation is the proposal of unconventional stepsize schedules (dubbed slingshot stepsize schedules) that are time-varying, asymmetric, and periodically negative. We show that all three properties are necessary for convergence, and that altogether this enables GDA to converge on the classical counterexamples (e.g., unconstrained convex-concave problems). All of our results apply to the last iterate of GDA, as is typically desired in practice. The core algorithmic intuition is that although negative stepsizes make backward progress, they de-synchronize the min and max variables (overcoming the cycling issue of GDA), and lead to a slingshot phenomenon in which the forward progress in the other iterations is overwhelmingly larger. This results in fast overall convergence. Geometrically, the slingshot dynamics leverage the non-reversibility of gradient flow: positive/negative steps cancel to first order, yielding a second-order net movement in a new direction that leads to convergence and is otherwise impossible for GDA to move in. We interpret this as a second-order finite-differencing algorithm and show that, intriguingly, it approximately implements consensus optimization, an empirically popular algorithm for min-max problems involving deep neural networks (e.g., training GANs).
Related papers
- Fundamental Benefit of Alternating Updates in Minimax Optimization [15.170534286366744]
The Gradient Descent-Ascent gradients (GDA) algorithm is designed to solve minimax optimization problems.
While Alt-GDA is commonly observed to converge iteration faster, the performance gap between the two is not yet well understood.
We present fine-grained convergence analyses of both algorithms for strongly-strongly-concave and Lipschitz-gradient objectives.
arXiv Detail & Related papers (2024-02-16T06:41:35Z) - Rethinking PGD Attack: Is Sign Function Necessary? [131.6894310945647]
We present a theoretical analysis of how such sign-based update algorithm influences step-wise attack performance.
We propose a new raw gradient descent (RGD) algorithm that eliminates the use of sign.
The effectiveness of the proposed RGD algorithm has been demonstrated extensively in experiments.
arXiv Detail & Related papers (2023-12-03T02:26:58Z) - Differentiable Bilevel Programming for Stackelberg Congestion Games [47.60156422249365]
In a Stackelberg congestion game (SCG), a leader aims to maximize their own gain by anticipating and manipulating the equilibrium state at which the followers settle by playing a congestion game.
Here, we attempt to tackle this computational challenge by marrying traditional methodologies with the latest differentiable programming techniques in machine learning.
We propose two new local search algorithms for SCGs. The first is a gradient descent algorithm that obtains the derivatives by unrolling ILD via differentiable programming.
The second algorithm adds a twist by cutting short the followers' evolution trajectory.
arXiv Detail & Related papers (2022-09-15T21:32:23Z) - Randomized Stochastic Gradient Descent Ascent [37.887266927498395]
An increasing number of machine learning problems, such as robust or adversarial variants of existing algorithms, require minimizing a loss function.
We propose RSGDA (Randomized SGD), a variant of ESGDA with loop size with a simpler theoretical analysis.
arXiv Detail & Related papers (2021-11-25T16:44:19Z) - Escaping Saddle Points in Nonconvex Minimax Optimization via
Cubic-Regularized Gradient Descent-Ascent [13.565010494569243]
The gradient descent-ascent (GDA) algorithm has been widely applied to solve non minimax optimization problems.
We develop the first-type GDA algorithm for escaping points in non-strongly-concave minimax optimization.
arXiv Detail & Related papers (2021-10-14T00:36:44Z) - Solve Minimax Optimization by Anderson Acceleration [5.900781483345349]
Gradient descent ascent (GDA) is the most commonly used algorithm due to its simplicity.
We propose a new minimax optimization framework, GDA-AM, that views the GDAdynamics as a fixed-point iteration.
We show theoretically that the algorithm can achieve global convergence for bilinear problems under mild conditions.
arXiv Detail & Related papers (2021-10-06T02:08:54Z) - An Accelerated Variance-Reduced Conditional Gradient Sliding Algorithm
for First-order and Zeroth-order Optimization [111.24899593052851]
Conditional gradient algorithm (also known as the Frank-Wolfe algorithm) has recently regained popularity in the machine learning community.
ARCS is the first zeroth-order conditional gradient sliding type algorithms solving convex problems in zeroth-order optimization.
In first-order optimization, the convergence results of ARCS substantially outperform previous algorithms in terms of the number of gradient query oracle.
arXiv Detail & Related papers (2021-09-18T07:08:11Z) - Distributed stochastic optimization with large delays [59.95552973784946]
One of the most widely used methods for solving large-scale optimization problems is distributed asynchronous gradient descent (DASGD)
We show that DASGD converges to a global optimal implementation model under same delay assumptions.
arXiv Detail & Related papers (2021-07-06T21:59:49Z) - Correcting Momentum with Second-order Information [50.992629498861724]
We develop a new algorithm for non-critical optimization that finds an $O(epsilon)$epsilon point in the optimal product.
We validate our results on a variety of large-scale deep learning benchmarks and architectures.
arXiv Detail & Related papers (2021-03-04T19:01:20Z) - Proximal Gradient Descent-Ascent: Variable Convergence under K{\L}
Geometry [49.65455534654459]
The finite descent-ascent parameters (GDA) has been widely applied to solve minimax optimization problems.
This paper fills such a gap by studying the convergence of the KL-Lized geometry.
arXiv Detail & Related papers (2021-02-09T05:35:53Z) - Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth
Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step.
Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z) - Global Convergence and Variance-Reduced Optimization for a Class of
Nonconvex-Nonconcave Minimax Problems [39.13924898972611]
Non minimaxewicz problems appear frequently in emerging machine learning applications generative adversarial networks and adversarial learning.
GDA algorithms with constant size can potentially diverge even in the convex setting.
AGDA algorithm converges globally at a rate that attains a sub rate.
arXiv Detail & Related papers (2020-02-22T04:20:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.