Gradient Descent on Logistic Regression with Non-Separable Data and Large Step Sizes
- URL: http://arxiv.org/abs/2406.05033v2
- Date: Mon, 04 Nov 2024 15:23:23 GMT
- Title: Gradient Descent on Logistic Regression with Non-Separable Data and Large Step Sizes
- Authors: Si Yi Meng, Antonio Orvieto, Daniel Yiming Cao, Christopher De Sa,
- Abstract summary: We study descent dynamics on logistic regression problems with large, constant step sizes.
Local convergence is guaranteed for all step sizes less than the critical step size, but global convergence is not.
- Score: 38.595892152591595
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study gradient descent (GD) dynamics on logistic regression problems with large, constant step sizes. For linearly-separable data, it is known that GD converges to the minimizer with arbitrarily large step sizes, a property which no longer holds when the problem is not separable. In fact, the behaviour can be much more complex -- a sequence of period-doubling bifurcations begins at the critical step size $2/\lambda$, where $\lambda$ is the largest eigenvalue of the Hessian at the solution. Using a smaller-than-critical step size guarantees convergence if initialized nearby the solution: but does this suffice globally? In one dimension, we show that a step size less than $1/\lambda$ suffices for global convergence. However, for all step sizes between $1/\lambda$ and the critical step size $2/\lambda$, one can construct a dataset such that GD converges to a stable cycle. In higher dimensions, this is actually possible even for step sizes less than $1/\lambda$. Our results show that although local convergence is guaranteed for all step sizes less than the critical step size, global convergence is not, and GD may instead converge to a cycle depending on the initialization.
Related papers
- Gradient Descent on Logistic Regression: Do Large Step-Sizes Work with Data on the Sphere? [40.358326886432664]
This paper explores whether restricting the data to have equal magnitude is a sufficient condition for global convergence.<n>We prove this is true in a one dimensional space, but in higher dimensions cycling behaviour can still occur.<n>We hope to inspire further studies on quantifying how common these cycles are in realistic datasets.
arXiv Detail & Related papers (2025-07-15T11:58:42Z) - Large Stepsizes Accelerate Gradient Descent for Regularized Logistic Regression [9.15420898792103]
We study gradient descent (GD) with a constant stepsize for $ell$-regularized logistic regression with linearly separable data.<n>We show that this can be accelerated to $widetildemathcalO(sqrtkappa)$ by simply using a large stepsize.
arXiv Detail & Related papers (2025-06-03T00:21:22Z) - Minimax Optimal Convergence of Gradient Descent in Logistic Regression via Large and Adaptive Stepsizes [32.64968378005787]
We study $textitgradient descent$ (GD) for logistic regression on linearly separable data with stepsizes that adapt to the current risk.
We show that GD achieves a risk upper bounded by $exp(-Theta(eta))$, where $gamma$ is the margin of the dataset.
Notably, the classical $textitPerceptron$ (Novikoff, 1962), a first-order online method, also achieves a step complexity of $1/gamma2$, matching GD even in constants.
arXiv Detail & Related papers (2025-04-05T08:34:20Z) - Two-Timescale Linear Stochastic Approximation: Constant Stepsizes Go a Long Way [12.331596909999764]
We investigate it constant stpesize schemes through the lens of Markov processes.
We derive explicit geometric and non-asymptotic convergence rates, as well as the variance and bias introduced by constant stepsizes.
arXiv Detail & Related papers (2024-10-16T21:49:27Z) - Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency [47.8739414267201]
We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data.
We show that GD exits this initial oscillatory phase rapidly -- in $mathcalO(eta)$ steps -- and subsequently achieves an $tildemathcalO (1 / (eta t) )$ convergence rate.
Our results imply that, given a budget of $T$ steps, GD can achieve an accelerated loss of $tildemathcalO (1/T2)$ with an aggressive stepsize
arXiv Detail & Related papers (2024-02-24T23:10:28Z) - Proximal Gradient Descent-Ascent: Variable Convergence under K{\L}
Geometry [49.65455534654459]
The finite descent-ascent parameters (GDA) has been widely applied to solve minimax optimization problems.
This paper fills such a gap by studying the convergence of the KL-Lized geometry.
arXiv Detail & Related papers (2021-02-09T05:35:53Z) - Last iterate convergence of SGD for Least-Squares in the Interpolation
regime [19.05750582096579]
We study the noiseless model in the fundamental least-squares setup.
We assume that an optimum predictor fits perfectly inputs and outputs $langle theta_*, phi(X) rangle = Y$, where $phi(X)$ stands for a possibly infinite dimensional non-linear feature map.
arXiv Detail & Related papers (2021-02-05T14:02:20Z) - Hybrid Stochastic-Deterministic Minibatch Proximal Gradient:
Less-Than-Single-Pass Optimization with Nearly Optimal Generalization [83.80460802169999]
We show that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
For loss factors, we prove that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
arXiv Detail & Related papers (2020-09-18T02:18:44Z) - On the Almost Sure Convergence of Stochastic Gradient Descent in
Non-Convex Problems [75.58134963501094]
This paper analyzes the trajectories of gradient descent (SGD)
We show that SGD avoids saddle points/manifolds with $1$ for strict step-size policies.
arXiv Detail & Related papers (2020-06-19T14:11:26Z) - A Simple Convergence Proof of Adam and Adagrad [74.24716715922759]
We show a proof of convergence between the Adam Adagrad and $O(d(N)/st)$ algorithms.
Adam converges with the same convergence $O(d(N)/st)$ when used with the default parameters.
arXiv Detail & Related papers (2020-03-05T01:56:17Z) - Non-asymptotic Convergence of Adam-type Reinforcement Learning
Algorithms under Markovian Sampling [56.394284787780364]
This paper provides the first theoretical convergence analysis for two fundamental RL algorithms of policy gradient (PG) and temporal difference (TD) learning.
Under general nonlinear function approximation, PG-AMSGrad with a constant stepsize converges to a neighborhood of a stationary point at the rate of $mathcalO(log T/sqrtT)$.
Under linear function approximation, TD-AMSGrad with a constant stepsize converges to a neighborhood of the global optimum at the rate of $mathcalO(log T/sqrtT
arXiv Detail & Related papers (2020-02-15T00:26:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.