Lower Generalization Bounds for GD and SGD in Smooth Stochastic Convex
Optimization
- URL: http://arxiv.org/abs/2303.10758v2
- Date: Wed, 10 May 2023 15:16:59 GMT
- Title: Lower Generalization Bounds for GD and SGD in Smooth Stochastic Convex
Optimization
- Authors: Peiyuan Zhang, Jiaye Teng, Jingzhao Zhang
- Abstract summary: Training steps $T$ and step-size $eta$ might affect certify in smooth convex optimization (SCO) problems.
We first provide tight excess risk lower bounds for Gradient Descent (GD) and Gradient Descent (SGD)
Recent works show better rates can be attained but the improvement is reduced when training time is long.
- Score: 9.019243171993553
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work studies the generalization error of gradient methods. More
specifically, we focus on how training steps $T$ and step-size $\eta$ might
affect generalization in smooth stochastic convex optimization (SCO) problems.
We first provide tight excess risk lower bounds for Gradient Descent (GD) and
Stochastic Gradient Descent (SGD) under the general non-realizable smooth SCO
setting, suggesting that existing stability analyses are tight in step-size and
iteration dependence, and that overfitting provably happens. Next, we study the
case when the loss is realizable, i.e. an optimal solution minimizes all the
data points. Recent works show better rates can be attained but the improvement
is reduced when training time is long. Our paper examines this observation by
providing excess risk lower bounds for GD and SGD in two realizable settings:
1) $\eta T = \bigO{n}$, and (2) $\eta T = \bigOmega{n}$, where $n$ is the size
of dataset. In the first case $\eta T = \bigOmega{n}$, our lower bounds tightly
match and certify the respective upper bounds. However, for the case $\eta T =
\bigOmega{n}$, our analysis indicates a gap between the lower and upper bounds.
A conjecture is proposed that the gap can be closed by improving upper bounds,
supported by analyses in two special scenarios.
Related papers
- Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems [56.86067111855056]
We consider clipped optimization problems with heavy-tailed noise with structured density.
We show that it is possible to get faster rates of convergence than $mathcalO(K-(alpha - 1)/alpha)$, when the gradients have finite moments of order.
We prove that the resulting estimates have negligible bias and controllable variance.
arXiv Detail & Related papers (2023-11-07T17:39:17Z) - Empirical Risk Minimization with Shuffled SGD: A Primal-Dual Perspective
and Improved Bounds [12.699376765058137]
gradient descent (SGD) is perhaps the most prevalent optimization method in modern machine learning.
It is only very recently that SGD with sampling without replacement -- shuffled SGD -- has been analyzed.
We prove fine-grained complexity bounds that depend on the data matrix and are never worse than what is predicted by the existing bounds.
arXiv Detail & Related papers (2023-06-21T18:14:44Z) - Towards Noise-adaptive, Problem-adaptive Stochastic Gradient Descent [7.176107039687231]
We design step-size schemes that make gradient descent (SGD) adaptive to (i) the noise.
We prove that $T$ iterations of SGD with Nesterov iterations can be near optimal.
Compared to other step-size schemes, we demonstrate the effectiveness of a novel novel exponential step-size scheme.
arXiv Detail & Related papers (2021-10-21T19:22:14Z) - The Convergence Rate of SGD's Final Iterate: Analysis on Dimension
Dependence [2.512827436728378]
Gradient Descent (SGD) is among the simplest and most popular methods in optimization.
We show how to characterize the final-iterate convergence of SGD in the constant dimension setting.
arXiv Detail & Related papers (2021-06-28T11:51:04Z) - High-probability Bounds for Non-Convex Stochastic Optimization with
Heavy Tails [55.561406656549686]
We consider non- Hilbert optimization using first-order algorithms for which the gradient estimates may have tails.
We show that a combination of gradient, momentum, and normalized gradient descent convergence to critical points in high-probability with best-known iteration for smooth losses.
arXiv Detail & Related papers (2021-06-28T00:17:01Z) - Last iterate convergence of SGD for Least-Squares in the Interpolation
regime [19.05750582096579]
We study the noiseless model in the fundamental least-squares setup.
We assume that an optimum predictor fits perfectly inputs and outputs $langle theta_*, phi(X) rangle = Y$, where $phi(X)$ stands for a possibly infinite dimensional non-linear feature map.
arXiv Detail & Related papers (2021-02-05T14:02:20Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - On the Almost Sure Convergence of Stochastic Gradient Descent in
Non-Convex Problems [75.58134963501094]
This paper analyzes the trajectories of gradient descent (SGD)
We show that SGD avoids saddle points/manifolds with $1$ for strict step-size policies.
arXiv Detail & Related papers (2020-06-19T14:11:26Z) - Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses [52.039438701530905]
We provide sharp upper and lower bounds for several forms of gradient descent (SGD) on arbitrary Lipschitz nonsmooth convex losses.
Our bounds allow us to derive a new algorithm for differentially private nonsmooth convex optimization with optimal excess population risk.
arXiv Detail & Related papers (2020-06-12T02:45:21Z) - Better Theory for SGD in the Nonconvex World [2.6397379133308214]
Large-scale non optimization problems are ubiquitous in modern machine learning.
We perform experiments on the effects of a wide array of synthetic minibatch sizes on the Gradient Descent (SG) problem.
arXiv Detail & Related papers (2020-02-09T09:56:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.