Related papers: On the One-sided Convergence of Adam-type Algorithms in Non-convex Non-concave Min-max Optimization

On the One-sided Convergence of Adam-type Algorithms in Non-convex Non-concave Min-max Optimization

URL: http://arxiv.org/abs/2109.14213v1
Date: Wed, 29 Sep 2021 06:38:39 GMT
Title: On the One-sided Convergence of Adam-type Algorithms in Non-convex Non-concave Min-max Optimization
Authors: Zehao Dou, Yuanzhi Li
Abstract summary: We show that Adam-type algorithms converge to a one-sided first order stationary points in min-max optimization problems under the one-sided MVI condition. We also empirically verify that such one-sided MVI condition is satisfied for standard GANs after trained over standard data.
Score: 43.504548777955854
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Adam-type methods, the extension of adaptive gradient methods, have shown great performance in the training of both supervised and unsupervised machine learning models. In particular, Adam-type optimizers have been widely used empirically as the default tool for training generative adversarial networks (GANs). On the theory side, however, despite the existence of theoretical results showing the efficiency of Adam-type methods in minimization problems, the reason of their wonderful performance still remains absent in GAN's training. In existing works, the fast convergence has long been considered as one of the most important reasons and multiple works have been proposed to give a theoretical guarantee of the convergence to a critical point of min-max optimization algorithms under certain assumptions. In this paper, we firstly argue empirically that in GAN's training, Adam does not converge to a critical point even upon successful training: Only the generator is converging while the discriminator's gradient norm remains high throughout the training. We name this one-sided convergence. Then we bridge the gap between experiments and theory by showing that Adam-type algorithms provably converge to a one-sided first order stationary points in min-max optimization problems under the one-sided MVI condition. We also empirically verify that such one-sided MVI condition is satisfied for standard GANs after trained over standard data sets. To the best of our knowledge, this is the very first result which provides an empirical observation and a strict theoretical guarantee on the one-sided convergence of Adam-type algorithms in min-max optimization.

Related papers

On The Concurrence of Layer-wise Preconditioning Methods and Provable Feature Learning [22.486361028522374]
We show that layer-wise preconditioning methods are provably necessary from a statistical perspective. We show that SGD is a suboptimal feature when extending beyond ideal isotropic inputs. We show that standard tools like Adam preconditioning and batch-norm only mildly mitigate these issues.
arXiv Detail & Related papers (2025-02-03T19:08:32Z)
Convergence of Adam for Non-convex Objectives: Relaxed Hyperparameters and Non-ergodic Case [0.0]
This paper focuses on exploring the convergence of vanilla Adam and the challenges of non-ergodic convergence. These findings build a solid theoretical foundation for Adam to solve non-godic optimization problems.
arXiv Detail & Related papers (2023-07-20T12:02:17Z)
Convergence of Adam Under Relaxed Assumptions [72.24779199744954]
We show that Adam converges to $epsilon$-stationary points with $O(epsilon-4)$ gradient complexity under far more realistic conditions. We also propose a variance-reduced version of Adam with an accelerated gradient complexity of $O(epsilon-3)$.
arXiv Detail & Related papers (2023-04-27T06:27:37Z)
Can Decentralized Stochastic Minimax Optimization Algorithms Converge Linearly for Finite-Sum Nonconvex-Nonconcave Problems? [56.62372517641597]
Decentralized minimax optimization has been actively studied in the past few years due to its application in a wide range machine learning. This paper develops two novel decentralized minimax optimization algorithms for the non-strongly-nonconcave problem.
arXiv Detail & Related papers (2023-04-24T02:19:39Z)
A Control Theoretic Framework for Adaptive Gradient Optimizers in Machine Learning [0.6526824510982802]
Adaptive gradient methods have become popular in optimizing deep neural networks. Recent examples include AdaGrad and Adam. We develop a generic framework for adaptive gradient methods.
arXiv Detail & Related papers (2022-06-04T17:55:33Z)
A Novel Convergence Analysis for Algorithms of the Adam Family [105.22760323075008]
We present a generic proof of convergence for a family of Adam-style methods including Adam, AMSGrad, Adabound, etc. Our analysis is so simple and generic that it can be leveraged to establish the convergence for solving a broader family of non- compositional optimization problems.
arXiv Detail & Related papers (2021-12-07T02:47:58Z)
Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization [118.50301177912381]
We show that Adam can converge to different solutions of the objective with provably different errors, even with weight decay globalization. We show that if convex, and the weight decay regularization is employed, any optimization algorithms including Adam will converge to the same solution.
arXiv Detail & Related papers (2021-08-25T17:58:21Z)
Towards Practical Adam: Non-Convexity, Convergence Theory, and Mini-Batch Acceleration [12.744658958445024]
Adam is one of the most influential adaptive algorithms for training deep neural networks. Existing approaches, such as decreasing an adaptive learning rate, adopting a big batch size, have tried to promote Adam-type algorithms to converge. We introduce an alternative easy-to-check sufficient condition, which merely depends on the parameters of historical base learning rate.
arXiv Detail & Related papers (2021-01-14T06:42:29Z)
Gradient Descent Averaging and Primal-dual Averaging for Strongly Convex Optimization [15.731908248435348]
We develop gradient descent averaging and primal-dual averaging algorithms for strongly convex cases. We prove that primal-dual averaging yields the optimal convergence rate in terms of output averaging, while SC-PDA derives the optimal individual convergence. Several experiments on SVMs and deep learning models validate the correctness of theoretical analysis and effectiveness of algorithms.
arXiv Detail & Related papers (2020-12-29T01:40:30Z)
Communication-Efficient Distributed Stochastic AUC Maximization with Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network. Our model requires a much less number of communication rounds and still a number of communication rounds in theory. Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.