Tight Bounds for Logistic Regression with Large Stepsize Gradient Descent in Low Dimension
- URL: http://arxiv.org/abs/2602.12471v1
- Date: Thu, 12 Feb 2026 22:58:18 GMT
- Title: Tight Bounds for Logistic Regression with Large Stepsize Gradient Descent in Low Dimension
- Authors: Michael Crawshaw, Mingrui Liu,
- Abstract summary: We consider the optimization problem of minimizing the logistic gradient with descent to train a linear model for binary classification with separable data.<n>We show that GD with a sufficiently large learning rate $$ finds a point with loss smaller than $mathcalO (1/(T))$, as long as $T geq (n/+ 1/2)$, where $n$ is the dataset size.
- Score: 36.3266119975955
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the optimization problem of minimizing the logistic loss with gradient descent to train a linear model for binary classification with separable data. With a budget of $T$ iterations, it was recently shown that an accelerated $1/T^2$ rate is possible by choosing a large step size $η= Θ(γ^2 T)$ (where $γ$ is the dataset's margin) despite the resulting non-monotonicity of the loss. In this paper, we provide a tighter analysis of gradient descent for this problem when the data is two-dimensional: we show that GD with a sufficiently large learning rate $η$ finds a point with loss smaller than $\mathcal{O}(1/(ηT))$, as long as $T \geq Ω(n/γ+ 1/γ^2)$, where $n$ is the dataset size. Our improved rate comes from a tighter bound on the time $τ$ that it takes for GD to transition from unstable (non-monotonic loss) to stable (monotonic loss), via a fine-grained analysis of the oscillatory dynamics of GD in the subspace orthogonal to the max-margin classifier. We also provide a lower bound of $τ$ matching our upper bound up to logarithmic factors, showing that our analysis is tight.
Related papers
- Minimax Optimal Convergence of Gradient Descent in Logistic Regression via Large and Adaptive Stepsizes [32.64968378005787]
We study $textitgradient descent$ (GD) for logistic regression on linearly separable data with stepsizes that adapt to the current risk.<n>We show that GD achieves a risk upper bounded by $exp(-Theta(eta))$, where $gamma$ is the margin of the dataset.<n> Notably, the classical $textitPerceptron$ (Novikoff, 1962), a first-order online method, also achieves a step complexity of $1/gamma2$, matching GD even in constants.
arXiv Detail & Related papers (2025-04-05T08:34:20Z) - Convergence Rate Analysis of LION [54.28350823319057]
LION converges iterations of $cal(sqrtdK-)$ measured by gradient Karush-Kuhn-T (sqrtdK-)$.
We show that LION can achieve lower loss and higher performance compared to standard SGD.
arXiv Detail & Related papers (2024-11-12T11:30:53Z) - Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency [47.8739414267201]
We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data.
We show that GD exits this initial oscillatory phase rapidly -- in $mathcalO(eta)$ steps -- and subsequently achieves an $tildemathcalO (1 / (eta t) )$ convergence rate.
Our results imply that, given a budget of $T$ steps, GD can achieve an accelerated loss of $tildemathcalO (1/T2)$ with an aggressive stepsize
arXiv Detail & Related papers (2024-02-24T23:10:28Z) - Almost Linear Constant-Factor Sketching for $\ell_1$ and Logistic
Regression [74.28017932704704]
We improve upon previous oblivious sketching and turnstile streaming results for $ell_1$ and logistic regression.
We also give a tradeoff that yields a $1+varepsilon$ approximation in input sparsity time.
Our sketch can be extended to approximate a regularized version of logistic regression where the data-dependent regularizer corresponds to the variance of the individual logistic losses.
arXiv Detail & Related papers (2023-03-31T18:12:33Z) - Lower Generalization Bounds for GD and SGD in Smooth Stochastic Convex
Optimization [9.019243171993553]
Training steps $T$ and step-size $eta$ might affect certify in smooth convex optimization (SCO) problems.
We first provide tight excess risk lower bounds for Gradient Descent (GD) and Gradient Descent (SGD)
Recent works show better rates can be attained but the improvement is reduced when training time is long.
arXiv Detail & Related papers (2023-03-19T20:24:33Z) - Gradient-Based Empirical Risk Minimization using Local Polynomial
Regression [39.29885444997579]
A major goal of this literature has been to compare different algorithms, such as gradient descent (GD) or gradient descent (SGD)
We demonstrate that when the loss function is smooth in the data, we can learn the oracle at every iteration and beat the oracle complexities of both GD and SGD.
arXiv Detail & Related papers (2020-11-04T20:10:31Z) - Hybrid Stochastic-Deterministic Minibatch Proximal Gradient:
Less-Than-Single-Pass Optimization with Nearly Optimal Generalization [83.80460802169999]
We show that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
For loss factors, we prove that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
arXiv Detail & Related papers (2020-09-18T02:18:44Z) - Optimal Robust Linear Regression in Nearly Linear Time [97.11565882347772]
We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = langle X,w* rangle + epsilon$
We propose estimators for this problem under two settings: (i) $X$ is L4-L2 hypercontractive, $mathbbE [XXtop]$ has bounded condition number and $epsilon$ has bounded variance and (ii) $X$ is sub-Gaussian with identity second moment and $epsilon$ is
arXiv Detail & Related papers (2020-07-16T06:44:44Z) - Online Robust Regression via SGD on the l1 loss [19.087335681007477]
We consider the robust linear regression problem in the online setting where we have access to the data in a streaming manner.
We show in this work that the descent on the $ell_O( 1 / (1 - eta)2 n )$ loss converges to the true parameter vector at a $tildeO( 1 / (1 - eta)2 n )$ rate which is independent of the values of the contaminated measurements.
arXiv Detail & Related papers (2020-07-01T11:38:21Z) - A Simple Convergence Proof of Adam and Adagrad [74.24716715922759]
We show a proof of convergence between the Adam Adagrad and $O(d(N)/st)$ algorithms.
Adam converges with the same convergence $O(d(N)/st)$ when used with the default parameters.
arXiv Detail & Related papers (2020-03-05T01:56:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.