Related papers: Implicit Bias of Gradient Descent for Logistic Regression at the Edge of Stability

Implicit Bias of Gradient Descent for Logistic Regression at the Edge of Stability

URL: http://arxiv.org/abs/2305.11788v2
Date: Sun, 15 Oct 2023 17:53:26 GMT
Title: Implicit Bias of Gradient Descent for Logistic Regression at the Edge of Stability
Authors: Jingfeng Wu, Vladimir Braverman, Jason D. Lee
Abstract summary: In machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS) This paper studies the convergence and implicit bias of constant-stepsize GD for logistic regression on linearly separable data in the EoS regime.
Score: 69.01076284478151
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent research has observed that in machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS) [Cohen, et al., 2021], where the stepsizes are set to be large, resulting in non-monotonic losses induced by the GD iterates. This paper studies the convergence and implicit bias of constant-stepsize GD for logistic regression on linearly separable data in the EoS regime. Despite the presence of local oscillations, we prove that the logistic loss can be minimized by GD with \emph{any} constant stepsize over a long time scale. Furthermore, we prove that with \emph{any} constant stepsize, the GD iterates tend to infinity when projected to a max-margin direction (the hard-margin SVM direction) and converge to a fixed vector that minimizes a strongly convex potential when projected to the orthogonal complement of the max-margin direction. In contrast, we also show that in the EoS regime, GD iterates may diverge catastrophically under the exponential loss, highlighting the superiority of the logistic loss. These theoretical findings are in line with numerical simulations and complement existing theories on the convergence and implicit bias of GD for logistic regression, which are only applicable when the stepsizes are sufficiently small.

Related papers

Benefits of Early Stopping in Gradient Descent for Overparameterized Logistic Regression [28.3662709740417]
In logistic regression, gradient descent (GD) diverges in norm while converging in direction to the maximum $ell$-margin solution. This work investigates additional regularization effects induced by early stopping in well-specified high-dimensional logistic regression.
arXiv Detail & Related papers (2025-02-18T21:04:06Z)
Any-stepsize Gradient Descent for Separable Data under Fenchel--Young Losses [17.835960292396255]
We show arbitrary-stepsize gradient convergence for a general loss function based on the framework of emphFenchel--Young losses. We argue that these better rate is possible because of emphseparation margin of loss functions, instead of the self-bounding property.
arXiv Detail & Related papers (2025-02-07T12:52:12Z)
Algorithmic Stability of Stochastic Gradient Descent with Momentum under Heavy-Tailed Noise [20.922456964393213]
We establish generalization bounds for SGD with momentum (SGDm) under heavy-tailed noise. For quadratic loss functions, we show that SGDm admits a worse generalization bound in the presence of momentum and heavy tails. We develop a uniform-in-time discretization error bound, which to our knowledge, is the first result of its kind for SDEs with degenerate noise.
arXiv Detail & Related papers (2025-02-02T19:25:48Z)
Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency [47.8739414267201]
We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data. We show that GD exits this initial oscillatory phase rapidly -- in $mathcalO(eta)$ steps -- and subsequently achieves an $tildemathcalO (1 / (eta t) )$ convergence rate. Our results imply that, given a budget of $T$ steps, GD can achieve an accelerated loss of $tildemathcalO (1/T2)$ with an aggressive stepsize
arXiv Detail & Related papers (2024-02-24T23:10:28Z)
Implicit Bias and Fast Convergence Rates for Self-attention [26.766649949420746]
We study the fundamental optimization principles of self-attention, the defining mechanism of transformers. We analyze the implicit bias of gradient-baseds in a self-attention layer with a decoder in a linear classification.
arXiv Detail & Related papers (2024-02-08T15:15:09Z)
Max-affine regression via first-order methods [7.12511675782289]
The max-affine model ubiquitously arises in applications in signal processing and statistics. We present a non-asymptotic convergence analysis of gradient descent (GD) and mini-batch gradient descent (SGD) for max-affine regression.
arXiv Detail & Related papers (2023-08-15T23:46:44Z)
Beyond the Edge of Stability via Two-step Gradient Updates [49.03389279816152]
Gradient Descent (GD) is a powerful workhorse of modern machine learning. GD's ability to find local minimisers is only guaranteed for losses with Lipschitz gradients. This work focuses on simple, yet representative, learning problems via analysis of two-step gradient updates.
arXiv Detail & Related papers (2022-06-08T21:32:50Z)
High-dimensional limit theorems for SGD: Effective dynamics and critical scaling [6.950316788263433]
We prove limit theorems for the trajectories of summary statistics of gradient descent (SGD) We show a critical scaling regime for the step-size, below which the effective ballistic dynamics matches gradient flow for the population loss. About the fixed points of this effective dynamics, the corresponding diffusive limits can be quite complex and even degenerate.
arXiv Detail & Related papers (2022-06-08T17:42:18Z)
Chaotic Regularization and Heavy-Tailed Limits for Deterministic Gradient Descent [4.511923587827301]
gradient descent (GD) can achieve improved generalization when its dynamics exhibits a chaotic behavior. In this study, we incorporate a chaotic component to GD in a controlled manner, and introduce multiscale perturbed GD (MPGD) MPGD is a novel optimization framework where the GD recursion is augmented with chaotic perturbations that evolve via an independent dynamical system.
arXiv Detail & Related papers (2022-05-23T14:47:55Z)
On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD) We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting. We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z)
Last Iterate Risk Bounds of SGD with Decaying Stepsize for Overparameterized Linear Regression [122.70478935214128]
gradient descent (SGD) has been demonstrated to generalize well in many deep learning applications. This paper provides problem-dependent analysis on the last iterate risk bounds of SGD with decaying stepsize.
arXiv Detail & Related papers (2021-10-12T17:49:54Z)
Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically. This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression. We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z)
Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime. We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.