Implicit Bias of Gradient Descent for Logistic Regression at the Edge of
Stability
- URL: http://arxiv.org/abs/2305.11788v2
- Date: Sun, 15 Oct 2023 17:53:26 GMT
- Title: Implicit Bias of Gradient Descent for Logistic Regression at the Edge of
Stability
- Authors: Jingfeng Wu, Vladimir Braverman, Jason D. Lee
- Abstract summary: In machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS)
This paper studies the convergence and implicit bias of constant-stepsize GD for logistic regression on linearly separable data in the EoS regime.
- Score: 69.01076284478151
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent research has observed that in machine learning optimization, gradient
descent (GD) often operates at the edge of stability (EoS) [Cohen, et al.,
2021], where the stepsizes are set to be large, resulting in non-monotonic
losses induced by the GD iterates. This paper studies the convergence and
implicit bias of constant-stepsize GD for logistic regression on linearly
separable data in the EoS regime. Despite the presence of local oscillations,
we prove that the logistic loss can be minimized by GD with \emph{any} constant
stepsize over a long time scale. Furthermore, we prove that with \emph{any}
constant stepsize, the GD iterates tend to infinity when projected to a
max-margin direction (the hard-margin SVM direction) and converge to a fixed
vector that minimizes a strongly convex potential when projected to the
orthogonal complement of the max-margin direction. In contrast, we also show
that in the EoS regime, GD iterates may diverge catastrophically under the
exponential loss, highlighting the superiority of the logistic loss. These
theoretical findings are in line with numerical simulations and complement
existing theories on the convergence and implicit bias of GD for logistic
regression, which are only applicable when the stepsizes are sufficiently
small.
Related papers
- Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency [47.8739414267201]
We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data.
We show that GD exits this initial oscillatory phase rapidly -- in $mathcalO(eta)$ steps -- and subsequently achieves an $tildemathcalO (1 / (eta t) )$ convergence rate.
Our results imply that, given a budget of $T$ steps, GD can achieve an accelerated loss of $tildemathcalO (1/T2)$ with an aggressive stepsize
arXiv Detail & Related papers (2024-02-24T23:10:28Z) - Max-affine regression via first-order methods [7.12511675782289]
The max-affine model ubiquitously arises in applications in signal processing and statistics.
We present a non-asymptotic convergence analysis of gradient descent (GD) and mini-batch gradient descent (SGD) for max-affine regression.
arXiv Detail & Related papers (2023-08-15T23:46:44Z) - Beyond the Edge of Stability via Two-step Gradient Updates [49.03389279816152]
Gradient Descent (GD) is a powerful workhorse of modern machine learning.
GD's ability to find local minimisers is only guaranteed for losses with Lipschitz gradients.
This work focuses on simple, yet representative, learning problems via analysis of two-step gradient updates.
arXiv Detail & Related papers (2022-06-08T21:32:50Z) - High-dimensional limit theorems for SGD: Effective dynamics and critical
scaling [6.950316788263433]
We prove limit theorems for the trajectories of summary statistics of gradient descent (SGD)
We show a critical scaling regime for the step-size, below which the effective ballistic dynamics matches gradient flow for the population loss.
About the fixed points of this effective dynamics, the corresponding diffusive limits can be quite complex and even degenerate.
arXiv Detail & Related papers (2022-06-08T17:42:18Z) - Chaotic Regularization and Heavy-Tailed Limits for Deterministic
Gradient Descent [4.511923587827301]
gradient descent (GD) can achieve improved generalization when its dynamics exhibits a chaotic behavior.
In this study, we incorporate a chaotic component to GD in a controlled manner, and introduce multiscale perturbed GD (MPGD)
MPGD is a novel optimization framework where the GD recursion is augmented with chaotic perturbations that evolve via an independent dynamical system.
arXiv Detail & Related papers (2022-05-23T14:47:55Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - Last Iterate Risk Bounds of SGD with Decaying Stepsize for
Overparameterized Linear Regression [122.70478935214128]
gradient descent (SGD) has been demonstrated to generalize well in many deep learning applications.
This paper provides problem-dependent analysis on the last iterate risk bounds of SGD with decaying stepsize.
arXiv Detail & Related papers (2021-10-12T17:49:54Z) - Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically.
This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression.
We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.