The High Line: Exact Risk and Learning Rate Curves of Stochastic Adaptive Learning Rate Algorithms
- URL: http://arxiv.org/abs/2405.19585v1
- Date: Thu, 30 May 2024 00:27:52 GMT
- Title: The High Line: Exact Risk and Learning Rate Curves of Stochastic Adaptive Learning Rate Algorithms
- Authors: Elizabeth Collins-Woodfin, Inbar Seroussi, Begoña García Malaxechebarría, Andrew W. Mackenzie, Elliot Paquette, Courtney Paquette,
- Abstract summary: We develop a framework for analyzing the training and learning rate dynamics on a large class of high-dimensional optimization problems.
We give exact expressions for the risk and learning rate curves in terms of a deterministic solution to a system of ODEs.
We investigate in detail two adaptive learning rates -- an idealized exact line search and AdaGrad-Norm on the least squares problem.
- Score: 8.681909776958184
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We develop a framework for analyzing the training and learning rate dynamics on a large class of high-dimensional optimization problems, which we call the high line, trained using one-pass stochastic gradient descent (SGD) with adaptive learning rates. We give exact expressions for the risk and learning rate curves in terms of a deterministic solution to a system of ODEs. We then investigate in detail two adaptive learning rates -- an idealized exact line search and AdaGrad-Norm -- on the least squares problem. When the data covariance matrix has strictly positive eigenvalues, this idealized exact line search strategy can exhibit arbitrarily slower convergence when compared to the optimal fixed learning rate with SGD. Moreover we exactly characterize the limiting learning rate (as time goes to infinity) for line search in the setting where the data covariance has only two distinct eigenvalues. For noiseless targets, we further demonstrate that the AdaGrad-Norm learning rate converges to a deterministic constant inversely proportional to the average eigenvalue of the data covariance matrix, and identify a phase transition when the covariance density of eigenvalues follows a power law distribution.
Related papers
- A Stochastic Approach to Bi-Level Optimization for Hyperparameter Optimization and Meta Learning [74.80956524812714]
We tackle the general differentiable meta learning problem that is ubiquitous in modern deep learning.
These problems are often formalized as Bi-Level optimizations (BLO)
We introduce a novel perspective by turning a given BLO problem into a ii optimization, where the inner loss function becomes a smooth distribution, and the outer loss becomes an expected loss over the inner distribution.
arXiv Detail & Related papers (2024-10-14T12:10:06Z) - Improving Adaptive Online Learning Using Refined Discretization [44.646191058243645]
We study unconstrained Online Linear Optimization with Lipschitz losses.
Motivated by the pursuit of instance optimality, we propose a new algorithm.
Central to these results is a continuous time approach to online learning.
arXiv Detail & Related papers (2023-09-27T21:54:52Z) - Constrained Optimization via Exact Augmented Lagrangian and Randomized
Iterative Sketching [55.28394191394675]
We develop an adaptive inexact Newton method for equality-constrained nonlinear, nonIBS optimization problems.
We demonstrate the superior performance of our method on benchmark nonlinear problems, constrained logistic regression with data from LVM, and a PDE-constrained problem.
arXiv Detail & Related papers (2023-05-28T06:33:37Z) - Efficient and Near-Optimal Smoothed Online Learning for Generalized
Linear Functions [28.30744223973527]
We give a computationally efficient algorithm that is the first to enjoy the statistically optimal log(T/sigma) regret for realizable K-wise linear classification.
We develop a novel characterization of the geometry of the disagreement region induced by generalized linear classifiers.
arXiv Detail & Related papers (2022-05-25T21:31:36Z) - Data-heterogeneity-aware Mixing for Decentralized Learning [63.83913592085953]
We characterize the dependence of convergence on the relationship between the mixing weights of the graph and the data heterogeneity across nodes.
We propose a metric that quantifies the ability of a graph to mix the current gradients.
Motivated by our analysis, we propose an approach that periodically and efficiently optimize the metric.
arXiv Detail & Related papers (2022-04-13T15:54:35Z) - Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically.
This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression.
We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z) - Adaptive and Oblivious Randomized Subspace Methods for High-Dimensional
Optimization: Sharp Analysis and Lower Bounds [37.03247707259297]
A suitable adaptive subspace can be generated by sampling a correlated random matrix whose second order statistics mirror the input data.
We show that the relative error of the randomized approximations can be tightly characterized in terms of the spectrum of the data matrix.
Experimental results show that the proposed approach enables significant speed ups in a wide variety of machine learning and optimization problems.
arXiv Detail & Related papers (2020-12-13T13:02:31Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - GTAdam: Gradient Tracking with Adaptive Momentum for Distributed Online
Optimization [4.103281325880475]
This paper deals with a network of computing agents aiming to solve an online optimization problem in a distributed fashion, by means of local computation and communication, without any central coordinator.
We propose the gradient tracking with adaptive momentum estimation (GTAdam) distributed algorithm, which combines a gradient tracking mechanism with first and second order momentum estimates of the gradient.
In these numerical experiments from multi-agent learning, GTAdam outperforms state-of-the-art distributed optimization methods.
arXiv Detail & Related papers (2020-09-03T15:20:21Z) - To Each Optimizer a Norm, To Each Norm its Generalization [31.682969645989512]
We study the implicit regularization of optimization methods for linear models interpolating the training data in the under-parametrized and over-parametrized regimes.
We argue that analyzing convergence to the standard maximum l2-margin is arbitrary and show that minimizing the norm induced by the data results in better generalizations.
arXiv Detail & Related papers (2020-06-11T21:07:38Z) - On Learning Rates and Schr\"odinger Operators [105.32118775014015]
We present a general theoretical analysis of the effect of the learning rate.
We find that the learning rate tends to zero for a broad non- neural class functions.
arXiv Detail & Related papers (2020-04-15T09:52:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.