Related papers: Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency

Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency

URL: http://arxiv.org/abs/2402.15926v2
Date: Mon, 10 Jun 2024 03:32:05 GMT
Title: Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency
Authors: Jingfeng Wu, Peter L. Bartlett, Matus Telgarsky, Bin Yu,
Abstract summary: We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data. We show that GD exits this initial oscillatory phase rapidly -- in $mathcalO(eta)$ steps -- and subsequently achieves an $tildemathcalO (1 / (eta t) )$ convergence rate. Our results imply that, given a budget of $T$ steps, GD can achieve an accelerated loss of $tildemathcalO (1/T2)$ with an aggressive stepsize
Score: 47.8739414267201
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data, where the constant stepsize $\eta$ is so large that the loss initially oscillates. We show that GD exits this initial oscillatory phase rapidly -- in $\mathcal{O}(\eta)$ steps -- and subsequently achieves an $\tilde{\mathcal{O}}(1 / (\eta t) )$ convergence rate after $t$ additional steps. Our results imply that, given a budget of $T$ steps, GD can achieve an accelerated loss of $\tilde{\mathcal{O}}(1/T^2)$ with an aggressive stepsize $\eta:= \Theta( T)$, without any use of momentum or variable stepsize schedulers. Our proof technique is versatile and also handles general classification loss functions (where exponential tails are needed for the $\tilde{\mathcal{O}}(1/T^2)$ acceleration), nonlinear predictors in the neural tangent kernel regime, and online stochastic gradient descent (SGD) with a large stepsize, under suitable separability conditions.

Related papers

Fast Last-Iterate Convergence of SGD in the Smooth Interpolation Regime [26.711510824243803]
We study population convergence guarantees of gradient descent (SGD) for smooth convex objectives in the regime, where the noise at optimum is zero or near zero.<n>For a well-tuned stepsize we obtain a near optimal $widetildeO (1/T + sigma_star/sqrtT)$ rate for the last iterate.
arXiv Detail & Related papers (2025-07-15T12:52:47Z)
Large Stepsizes Accelerate Gradient Descent for Regularized Logistic Regression [9.15420898792103]
We study gradient descent (GD) with a constant stepsize for $ell$-regularized logistic regression with linearly separable data.<n>We show that this can be accelerated to $widetildemathcalO(sqrtkappa)$ by simply using a large stepsize.
arXiv Detail & Related papers (2025-06-03T00:21:22Z)
Any-stepsize Gradient Descent for Separable Data under Fenchel--Young Losses [17.835960292396255]
We show arbitrary-stepsize gradient convergence for a general loss function based on the framework of emphFenchel--Young losses. We argue that these better rate is possible because of emphseparation margin of loss functions, instead of the self-bounding property.
arXiv Detail & Related papers (2025-02-07T12:52:12Z)
Implicit Bias of Gradient Descent for Logistic Regression at the Edge of Stability [69.01076284478151]
In machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS) This paper studies the convergence and implicit bias of constant-stepsize GD for logistic regression on linearly separable data in the EoS regime.
arXiv Detail & Related papers (2023-05-19T16:24:47Z)
High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation [89.21686761957383]
We study the first gradient descent step on the first-layer parameters $boldsymbolW$ in a two-layer network. Our results demonstrate that even one step can lead to a considerable advantage over random features.
arXiv Detail & Related papers (2022-05-03T12:09:59Z)
Towards Noise-adaptive, Problem-adaptive Stochastic Gradient Descent [7.176107039687231]
We design step-size schemes that make gradient descent (SGD) adaptive to (i) the noise. We prove that $T$ iterations of SGD with Nesterov iterations can be near optimal. Compared to other step-size schemes, we demonstrate the effectiveness of a novel novel exponential step-size scheme.
arXiv Detail & Related papers (2021-10-21T19:22:14Z)
On the Convergence of Step Decay Step-Size for Stochastic Optimization [27.02857082612736]
The convergence of neural descent is highly dependent on the step-size rate, especially on non-math problems such as network problems. We provide the convergence for decay in the non-smooth regime, ensuring the gradient norm vanishes. In the strongly convex case establish an $( T/ST)$ rate, we also prove to be an $( T/ST)$ rate.
arXiv Detail & Related papers (2021-02-18T14:37:25Z)
Last iterate convergence of SGD for Least-Squares in the Interpolation regime [19.05750582096579]
We study the noiseless model in the fundamental least-squares setup. We assume that an optimum predictor fits perfectly inputs and outputs $langle theta_*, phi(X) rangle = Y$, where $phi(X)$ stands for a possibly infinite dimensional non-linear feature map.
arXiv Detail & Related papers (2021-02-05T14:02:20Z)
Sample Complexity Bounds for Two Timescale Value-based Reinforcement Learning Algorithms [65.09383385484007]
Two timescale approximation (SA) has been widely used in value-based reinforcement learning algorithms. We study the non-asymptotic convergence rate of two timescale linear and nonlinear TDC and Greedy-GQ algorithms.
arXiv Detail & Related papers (2020-11-10T11:36:30Z)
Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime. We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z)
Balancing Rates and Variance via Adaptive Batch-Size for Stochastic Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error. Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z)
Gradient Methods Never Overfit On Separable Data [31.714928102950584]
We show that standard gradient methods never overfit on separable data. We present non-asymptotic bounds on the number of margin violations over the dataset.
arXiv Detail & Related papers (2020-06-30T18:01:46Z)
On the Almost Sure Convergence of Stochastic Gradient Descent in Non-Convex Problems [75.58134963501094]
This paper analyzes the trajectories of gradient descent (SGD) We show that SGD avoids saddle points/manifolds with $1$ for strict step-size policies.
arXiv Detail & Related papers (2020-06-19T14:11:26Z)
A Simple Convergence Proof of Adam and Adagrad [74.24716715922759]
We show a proof of convergence between the Adam Adagrad and $O(d(N)/st)$ algorithms. Adam converges with the same convergence $O(d(N)/st)$ when used with the default parameters.
arXiv Detail & Related papers (2020-03-05T01:56:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.