On the One-sided Convergence of Adam-type Algorithms in Non-convex
Non-concave Min-max Optimization
- URL: http://arxiv.org/abs/2109.14213v1
- Date: Wed, 29 Sep 2021 06:38:39 GMT
- Title: On the One-sided Convergence of Adam-type Algorithms in Non-convex
Non-concave Min-max Optimization
- Authors: Zehao Dou, Yuanzhi Li
- Abstract summary: We show that Adam-type algorithms converge to a one-sided first order stationary points in min-max optimization problems under the one-sided MVI condition.
We also empirically verify that such one-sided MVI condition is satisfied for standard GANs after trained over standard data.
- Score: 43.504548777955854
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Adam-type methods, the extension of adaptive gradient methods, have shown
great performance in the training of both supervised and unsupervised machine
learning models. In particular, Adam-type optimizers have been widely used
empirically as the default tool for training generative adversarial networks
(GANs). On the theory side, however, despite the existence of theoretical
results showing the efficiency of Adam-type methods in minimization problems,
the reason of their wonderful performance still remains absent in GAN's
training. In existing works, the fast convergence has long been considered as
one of the most important reasons and multiple works have been proposed to give
a theoretical guarantee of the convergence to a critical point of min-max
optimization algorithms under certain assumptions. In this paper, we firstly
argue empirically that in GAN's training, Adam does not converge to a critical
point even upon successful training: Only the generator is converging while the
discriminator's gradient norm remains high throughout the training. We name
this one-sided convergence. Then we bridge the gap between experiments and
theory by showing that Adam-type algorithms provably converge to a one-sided
first order stationary points in min-max optimization problems under the
one-sided MVI condition. We also empirically verify that such one-sided MVI
condition is satisfied for standard GANs after trained over standard data sets.
To the best of our knowledge, this is the very first result which provides an
empirical observation and a strict theoretical guarantee on the one-sided
convergence of Adam-type algorithms in min-max optimization.
Related papers
- Convergence of Adam for Non-convex Objectives: Relaxed Hyperparameters
and Non-ergodic Case [0.0]
This paper focuses on exploring the convergence of vanilla Adam and the challenges of non-ergodic convergence.
These findings build a solid theoretical foundation for Adam to solve non-godic optimization problems.
arXiv Detail & Related papers (2023-07-20T12:02:17Z) - Convergence of Adam Under Relaxed Assumptions [72.24779199744954]
We show that Adam converges to $epsilon$-stationary points with $O(epsilon-4)$ gradient complexity under far more realistic conditions.
We also propose a variance-reduced version of Adam with an accelerated gradient complexity of $O(epsilon-3)$.
arXiv Detail & Related papers (2023-04-27T06:27:37Z) - Can Decentralized Stochastic Minimax Optimization Algorithms Converge
Linearly for Finite-Sum Nonconvex-Nonconcave Problems? [56.62372517641597]
Decentralized minimax optimization has been actively studied in the past few years due to its application in a wide range machine learning.
This paper develops two novel decentralized minimax optimization algorithms for the non-strongly-nonconcave problem.
arXiv Detail & Related papers (2023-04-24T02:19:39Z) - A Control Theoretic Framework for Adaptive Gradient Optimizers in
Machine Learning [0.6526824510982802]
Adaptive gradient methods have become popular in optimizing deep neural networks.
Recent examples include AdaGrad and Adam.
We develop a generic framework for adaptive gradient methods.
arXiv Detail & Related papers (2022-06-04T17:55:33Z) - A Novel Convergence Analysis for Algorithms of the Adam Family [105.22760323075008]
We present a generic proof of convergence for a family of Adam-style methods including Adam, AMSGrad, Adabound, etc.
Our analysis is so simple and generic that it can be leveraged to establish the convergence for solving a broader family of non- compositional optimization problems.
arXiv Detail & Related papers (2021-12-07T02:47:58Z) - Understanding the Generalization of Adam in Learning Neural Networks
with Proper Regularization [118.50301177912381]
We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization.
We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
arXiv Detail & Related papers (2021-08-25T17:58:21Z) - Towards Practical Adam: Non-Convexity, Convergence Theory, and
Mini-Batch Acceleration [12.744658958445024]
Adam is one of the most influential adaptive algorithms for training deep neural networks.
Existing approaches, such as decreasing an adaptive learning rate, adopting a big batch size, have tried to promote Adam-type algorithms to converge.
We introduce an alternative easy-to-check sufficient condition, which merely depends on the parameters of historical base learning rate.
arXiv Detail & Related papers (2021-01-14T06:42:29Z) - Gradient Descent Averaging and Primal-dual Averaging for Strongly Convex
Optimization [15.731908248435348]
We develop gradient descent averaging and primal-dual averaging algorithms for strongly convex cases.
We prove that primal-dual averaging yields the optimal convergence rate in terms of output averaging, while SC-PDA derives the optimal individual convergence.
Several experiments on SVMs and deep learning models validate the correctness of theoretical analysis and effectiveness of algorithms.
arXiv Detail & Related papers (2020-12-29T01:40:30Z) - Communication-Efficient Distributed Stochastic AUC Maximization with
Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network.
Our model requires a much less number of communication rounds and still a number of communication rounds in theory.
Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.