AdaLoss: A computationally-efficient and provably convergent adaptive
gradient method
- URL: http://arxiv.org/abs/2109.08282v1
- Date: Fri, 17 Sep 2021 01:45:25 GMT
- Title: AdaLoss: A computationally-efficient and provably convergent adaptive
gradient method
- Authors: Xiaoxia Wu and Yuege Xie and Simon Du and Rachel Ward
- Abstract summary: We propose a computationally-friendly learning schedule, "AnomidaLoss", which uses the information of the loss function to adjust rate numerically.
We verify the scope of the numerical experiments by applications in LSTM models for text and control problems.
- Score: 7.856998585396422
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a computationally-friendly adaptive learning rate schedule,
"AdaLoss", which directly uses the information of the loss function to adjust
the stepsize in gradient descent methods. We prove that this schedule enjoys
linear convergence in linear regression. Moreover, we provide a linear
convergence guarantee over the non-convex regime, in the context of two-layer
over-parameterized neural networks. If the width of the first-hidden layer in
the two-layer networks is sufficiently large (polynomially), then AdaLoss
converges robustly \emph{to the global minimum} in polynomial time. We
numerically verify the theoretical results and extend the scope of the
numerical experiments by considering applications in LSTM models for text
clarification and policy gradients for control problems.
Related papers
- Hybrid Coordinate Descent for Efficient Neural Network Learning Using Line Search and Gradient Descent [3.8936716676293917]
This paper presents a novel coordinate descent algorithm for a squared error loss function.
Each parameter undergoes updates determined by either the line search or gradient method.
Its parallelizability facilitates computational time reduction.
arXiv Detail & Related papers (2024-08-02T16:29:54Z) - Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training.
Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z) - Robust Stochastically-Descending Unrolled Networks [85.6993263983062]
Deep unrolling is an emerging learning-to-optimize method that unrolls a truncated iterative algorithm in the layers of a trainable neural network.
We show that convergence guarantees and generalizability of the unrolled networks are still open theoretical problems.
We numerically assess unrolled architectures trained under the proposed constraints in two different applications.
arXiv Detail & Related papers (2023-12-25T18:51:23Z) - Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z) - Optimization Guarantees of Unfolded ISTA and ADMM Networks With Smooth
Soft-Thresholding [57.71603937699949]
We study optimization guarantees, i.e., achieving near-zero training loss with the increase in the number of learning epochs.
We show that the threshold on the number of training samples increases with the increase in the network width.
arXiv Detail & Related papers (2023-09-12T13:03:47Z) - Aiming towards the minimizers: fast convergence of SGD for
overparametrized problems [25.077446336619378]
We propose a regularity regime which endows the gradient method with the same worst-case complexity as the gradient method.
All existing guarantees require the gradient method to take small steps, thereby resulting in a much slower linear rate of convergence.
We demonstrate that our condition holds when training sufficiently wide feedforward neural networks with a linear output layer.
arXiv Detail & Related papers (2023-06-05T05:21:01Z) - Constrained Optimization via Exact Augmented Lagrangian and Randomized
Iterative Sketching [55.28394191394675]
We develop an adaptive inexact Newton method for equality-constrained nonlinear, nonIBS optimization problems.
We demonstrate the superior performance of our method on benchmark nonlinear problems, constrained logistic regression with data from LVM, and a PDE-constrained problem.
arXiv Detail & Related papers (2023-05-28T06:33:37Z) - Fast Convergence in Learning Two-Layer Neural Networks with Separable
Data [37.908159361149835]
We study normalized gradient descent on two-layer neural nets.
We prove for exponentially-tailed losses that using normalized GD leads to linear rate of convergence of the training loss to the global optimum.
arXiv Detail & Related papers (2023-05-22T20:30:10Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Improved Overparametrization Bounds for Global Convergence of Stochastic
Gradient Descent for Shallow Neural Networks [1.14219428942199]
We study the overparametrization bounds required for the global convergence of gradient descent algorithm for a class of one hidden layer feed-forward neural networks.
arXiv Detail & Related papers (2022-01-28T11:30:06Z) - The Implicit Bias of Gradient Descent on Separable Data [44.98410310356165]
We show the predictor converges to the direction of the max-margin (hard margin SVM) solution.
This can help explain the benefit of continuing to optimize the logistic or cross-entropy loss even after the training error is zero.
arXiv Detail & Related papers (2017-10-27T21:47:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.