Gradient Methods Never Overfit On Separable Data
- URL: http://arxiv.org/abs/2007.00028v2
- Date: Thu, 10 Sep 2020 10:51:50 GMT
- Title: Gradient Methods Never Overfit On Separable Data
- Authors: Ohad Shamir
- Abstract summary: We show that standard gradient methods never overfit on separable data.
We present non-asymptotic bounds on the number of margin violations over the dataset.
- Score: 31.714928102950584
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A line of recent works established that when training linear predictors over
separable data, using gradient methods and exponentially-tailed losses, the
predictors asymptotically converge in direction to the max-margin predictor. As
a consequence, the predictors asymptotically do not overfit. However, this does
not address the question of whether overfitting might occur non-asymptotically,
after some bounded number of iterations. In this paper, we formally show that
standard gradient methods (in particular, gradient flow, gradient descent and
stochastic gradient descent) never overfit on separable data: If we run these
methods for $T$ iterations on a dataset of size $m$, both the empirical risk
and the generalization error decrease at an essentially optimal rate of
$\tilde{\mathcal{O}}(1/\gamma^2 T)$ up till $T\approx m$, at which point the
generalization error remains fixed at an essentially optimal level of
$\tilde{\mathcal{O}}(1/\gamma^2 m)$ regardless of how large $T$ is. Along the
way, we present non-asymptotic bounds on the number of margin violations over
the dataset, and prove their tightness.
Related papers
- Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency [47.8739414267201]
We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data.
We show that GD exits this initial oscillatory phase rapidly -- in $mathcalO(eta)$ steps -- and subsequently achieves an $tildemathcalO (1 / (eta t) )$ convergence rate.
Our results imply that, given a budget of $T$ steps, GD can achieve an accelerated loss of $tildemathcalO (1/T2)$ with an aggressive stepsize
arXiv Detail & Related papers (2024-02-24T23:10:28Z) - On the Convergence of Gradient Descent for Large Learning Rates [55.33626480243135]
We show that convergence is impossible when a fixed step size is used.
We provide a proof of this in the case of linear neural networks with a squared loss.
We also prove the impossibility of convergence for more general losses without requiring strong assumptions such as Lipschitz continuity for the gradient.
arXiv Detail & Related papers (2024-02-20T16:01:42Z) - Tight Risk Bounds for Gradient Descent on Separable Data [33.593203156666746]
We study the generalization properties of unregularized gradient methods applied to separable linear classification.
Risk lower bounds are the first in this context and establish the tightness of our upper bounds for any given tail decay rate.
arXiv Detail & Related papers (2023-03-02T10:31:58Z) - Precise Learning Curves and Higher-Order Scaling Limits for Dot Product
Kernel Regression [41.48538038768993]
We focus on the problem of kernel ridge regression for dot-product kernels.
We observe a peak in the learning curve whenever $m approx dr/r!$ for any integer $r$, leading to multiple sample-wise descent and nontrivial behavior at multiple scales.
arXiv Detail & Related papers (2022-05-30T04:21:31Z) - Convergence of First-Order Methods for Constrained Nonconvex
Optimization with Dependent Data [7.513100214864646]
We show the worst-case complexity of convergence $tildeO(t-1/4)$ and MoreautildeO(vareps-4)$ for smooth non- optimization.
We obtain first online nonnegative matrix factorization algorithms for dependent data based on projected gradient methods with adaptive step sizes and optimal convergence.
arXiv Detail & Related papers (2022-03-29T17:59:10Z) - High-probability Bounds for Non-Convex Stochastic Optimization with
Heavy Tails [55.561406656549686]
We consider non- Hilbert optimization using first-order algorithms for which the gradient estimates may have tails.
We show that a combination of gradient, momentum, and normalized gradient descent convergence to critical points in high-probability with best-known iteration for smooth losses.
arXiv Detail & Related papers (2021-06-28T00:17:01Z) - Agnostic Learning of Halfspaces with Gradient Descent via Soft Margins [92.7662890047311]
We show that gradient descent finds halfspaces with classification error $tilde O(mathsfOPT1/2) + varepsilon$ in $mathrmpoly(d,1/varepsilon)$ time and sample complexity.
arXiv Detail & Related papers (2020-10-01T16:48:33Z) - ROOT-SGD: Sharp Nonasymptotics and Near-Optimal Asymptotics in a Single Algorithm [71.13558000599839]
We study the problem of solving strongly convex and smooth unconstrained optimization problems using first-order algorithms.
We devise a novel, referred to as Recursive One-Over-T SGD, based on an easily implementable, averaging of past gradients.
We prove that it simultaneously achieves state-of-the-art performance in both a finite-sample, nonasymptotic sense and an sense.
arXiv Detail & Related papers (2020-08-28T14:46:56Z) - Carath\'eodory Sampling for Stochastic Gradient Descent [79.55586575988292]
We present an approach that is inspired by classical results of Tchakaloff and Carath'eodory about measure reduction.
We adaptively select the descent steps where the measure reduction is carried out.
We combine this with Block Coordinate Descent so that measure reduction can be done very cheaply.
arXiv Detail & Related papers (2020-06-02T17:52:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.