Related papers: Benefits of Early Stopping in Gradient Descent for Overparameterized Logistic Regression

Benefits of Early Stopping in Gradient Descent for Overparameterized Logistic Regression

URL: http://arxiv.org/abs/2502.13283v1
Date: Tue, 18 Feb 2025 21:04:06 GMT
Title: Benefits of Early Stopping in Gradient Descent for Overparameterized Logistic Regression
Authors: Jingfeng Wu, Peter Bartlett, Matus Telgarsky, Bin Yu,
Abstract summary: In logistic regression, gradient descent (GD) diverges in norm while converging in direction to the maximum $ell$-margin solution.<n>This work investigates additional regularization effects induced by early stopping in well-specified high-dimensional logistic regression.
Score: 28.3662709740417
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In overparameterized logistic regression, gradient descent (GD) iterates diverge in norm while converging in direction to the maximum $\ell_2$-margin solution -- a phenomenon known as the implicit bias of GD. This work investigates additional regularization effects induced by early stopping in well-specified high-dimensional logistic regression. We first demonstrate that the excess logistic risk vanishes for early-stopped GD but diverges to infinity for GD iterates at convergence. This suggests that early-stopped GD is well-calibrated, whereas asymptotic GD is statistically inconsistent. Second, we show that to attain a small excess zero-one risk, polynomially many samples are sufficient for early-stopped GD, while exponentially many samples are necessary for any interpolating estimator, including asymptotic GD. This separation underscores the statistical benefits of early stopping in the overparameterized regime. Finally, we establish nonasymptotic bounds on the norm and angular differences between early-stopped GD and $\ell_2$-regularized empirical risk minimizer, thereby connecting the implicit regularization of GD with explicit $\ell_2$-regularization.

Related papers

Risk Comparisons in Linear Regression: Implicit Regularization Dominates Explicit Regularization [96.97196425604893]
Existing theory suggests that for linear regression problems categorized by capacity and source conditions, gradient descent (GD) is always minimax optimal.<n>This work provides instance-wise comparisons of the finite-sample risks for these algorithms on any well-specified linear regression problem.
arXiv Detail & Related papers (2025-09-21T22:02:38Z)
Nonasymptotic Analysis of Stochastic Gradient Descent with the Richardson-Romberg Extrapolation [22.652143194356864]
We address the problem of solving strongly convex and smooth problems using gradient descent (SGD) with a constant step size. We provide an expansion of the mean-squared error of the resulting estimator with respect to the number of iterations $n$. Our analysis relies on the properties of the SGDs viewed as a time-homogeneous Markov chain.
arXiv Detail & Related papers (2024-10-07T15:02:48Z)
Asymptotics of Stochastic Gradient Descent with Dropout Regularization in Linear Models [8.555650549124818]
This paper proposes a theory for online inference of the gradient descent (SGD) iterates with dropout regularization in linear regression. For sufficiently large samples, the proposed confidence intervals for ASGD with dropout nearly achieve the nominal coverage probability.
arXiv Detail & Related papers (2024-09-11T17:28:38Z)
Implicit Bias of Gradient Descent for Logistic Regression at the Edge of Stability [69.01076284478151]
In machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS) This paper studies the convergence and implicit bias of constant-stepsize GD for logistic regression on linearly separable data in the EoS regime.
arXiv Detail & Related papers (2023-05-19T16:24:47Z)
From Gradient Flow on Population Loss to Learning with Stochastic Gradient Descent [50.4531316289086]
Gradient Descent (SGD) has been the method of choice for learning large-scale non-root models. An overarching paper is providing general conditions SGD converges, assuming that GF on the population loss converges. We provide a unified analysis for GD/SGD not only for classical settings like convex losses, but also for more complex problems including Retrieval Matrix sq-root.
arXiv Detail & Related papers (2022-10-13T03:55:04Z)
High-dimensional limit theorems for SGD: Effective dynamics and critical scaling [6.950316788263433]
We prove limit theorems for the trajectories of summary statistics of gradient descent (SGD) We show a critical scaling regime for the step-size, below which the effective ballistic dynamics matches gradient flow for the population loss. About the fixed points of this effective dynamics, the corresponding diffusive limits can be quite complex and even degenerate.
arXiv Detail & Related papers (2022-06-08T17:42:18Z)
On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD) We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting. We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z)
Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically. This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression. We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z)
Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime. We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z)
ROOT-SGD: Sharp Nonasymptotics and Near-Optimal Asymptotics in a Single Algorithm [71.13558000599839]
We study the problem of solving strongly convex and smooth unconstrained optimization problems using first-order algorithms. We devise a novel, referred to as Recursive One-Over-T SGD, based on an easily implementable, averaging of past gradients. We prove that it simultaneously achieves state-of-the-art performance in both a finite-sample, nonasymptotic sense and an sense.
arXiv Detail & Related papers (2020-08-28T14:46:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.