Stable Nonconvex-Nonconcave Training via Linear Interpolation
- URL: http://arxiv.org/abs/2310.13459v4
- Date: Thu, 14 Mar 2024 15:52:16 GMT
- Title: Stable Nonconvex-Nonconcave Training via Linear Interpolation
- Authors: Thomas Pethick, Wanyun Xie, Volkan Cevher,
- Abstract summary: This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
- Score: 51.668052890249726
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents a theoretical analysis of linear interpolation as a principled method for stabilizing (large-scale) neural network training. We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear interpolation can help by leveraging the theory of nonexpansive operators. We construct a new optimization scheme called relaxed approximate proximal point (RAPP), which is the first explicit method without anchoring to achieve last iterate convergence rates for $\rho$-comonotone problems while only requiring $\rho > -\tfrac{1}{2L}$. The construction extends to constrained and regularized settings. By replacing the inner optimizer in RAPP we rediscover the family of Lookahead algorithms for which we establish convergence in cohypomonotone problems even when the base optimizer is taken to be gradient descent ascent. The range of cohypomonotone problems in which Lookahead converges is further expanded by exploiting that Lookahead inherits the properties of the base optimizer. We corroborate the results with experiments on generative adversarial networks which demonstrates the benefits of the linear interpolation present in both RAPP and Lookahead.
Related papers
- Error Feedback under $(L_0,L_1)$-Smoothness: Normalization and Momentum [56.37522020675243]
We provide the first proof of convergence for normalized error feedback algorithms across a wide range of machine learning problems.
We show that due to their larger allowable stepsizes, our new normalized error feedback algorithms outperform their non-normalized counterparts on various tasks.
arXiv Detail & Related papers (2024-10-22T10:19:27Z) - Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training.
Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z) - Implicit Bias and Fast Convergence Rates for Self-attention [30.08303212679308]
Self-attention, the core mechanism of transformers, distinguishes them from traditional neural networks and drives their outstanding performance.
We investigate the implicit bias of gradient descent (GD) in training a self-attention layer with fixed linear decoder in binary.
We provide the first finite-time convergence rate for $W_t$ to $W_mm$, along with the rate of sparsification in the attention map.
arXiv Detail & Related papers (2024-02-08T15:15:09Z) - Implicit regularization in AI meets generalized hardness of
approximation in optimization -- Sharp results for diagonal linear networks [0.0]
We show sharp results for the implicit regularization imposed by the gradient flow of Diagonal Linear Networks.
We link this to the phenomenon of phase transitions in generalized hardness of approximation.
Non-sharpness of our results would imply that the GHA phenomenon would not occur for the basis pursuit optimization problem.
arXiv Detail & Related papers (2023-07-13T13:27:51Z) - Implicit Regularization for Group Sparsity [33.487964460794764]
We show that gradient descent over the squared regression loss, without any explicit regularization, biases towards solutions with a group sparsity structure.
We analyze the gradient dynamics of the corresponding regression problem in the general noise setting and obtain minimax-optimal error rates.
In the degenerate case of size-one groups, our approach gives rise to a new algorithm for sparse linear regression.
arXiv Detail & Related papers (2023-01-29T20:54:03Z) - Fast Convex Optimization for Two-Layer ReLU Networks: Equivalent Model
Classes and Cone Decompositions [41.337814204665364]
We develop algorithms for convex optimization of two-layer neural networks with ReLU activation functions.
We show that convex gated ReLU models obtain data-dependent approximation bounds for the ReLU training problem.
arXiv Detail & Related papers (2022-02-02T23:50:53Z) - On the Explicit Role of Initialization on the Convergence and Implicit
Bias of Overparametrized Linear Networks [1.0323063834827415]
We present a novel analysis of single-hidden-layer linear networks trained under gradient flow.
We show that the squared loss converges exponentially to its optimum.
We derive a novel non-asymptotic upper-bound on the distance between the trained network and the min-norm solution.
arXiv Detail & Related papers (2021-05-13T15:13:51Z) - Leveraging Non-uniformity in First-order Non-convex Optimization [93.6817946818977]
Non-uniform refinement of objective functions leads to emphNon-uniform Smoothness (NS) and emphNon-uniform Lojasiewicz inequality (NL)
New definitions inspire new geometry-aware first-order methods that converge to global optimality faster than the classical $Omega (1/t2)$ lower bounds.
arXiv Detail & Related papers (2021-05-13T04:23:07Z) - Cogradient Descent for Bilinear Optimization [124.45816011848096]
We introduce a Cogradient Descent algorithm (CoGD) to address the bilinear problem.
We solve one variable by considering its coupling relationship with the other, leading to a synchronous gradient descent.
Our algorithm is applied to solve problems with one variable under the sparsity constraint.
arXiv Detail & Related papers (2020-06-16T13:41:54Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.