Beyond Lipschitz: Sharp Generalization and Excess Risk Bounds for
Full-Batch GD
- URL: http://arxiv.org/abs/2204.12446v1
- Date: Tue, 26 Apr 2022 17:05:57 GMT
- Title: Beyond Lipschitz: Sharp Generalization and Excess Risk Bounds for
Full-Batch GD
- Authors: Konstantinos E. Nikolakakis, Farzin Haddadpour, Amin Karbasi,
Dionysios S. Kalogerias
- Abstract summary: We provide sharp path-dependent and excess error guarantees for the full-batch Gradient Decent (GD) for smooth losses (possibly non-Lipschitz)
Our full-batch generalization error and excess risk bounds are significantly tighter than the existing bounds for GD, when the loss is smooth (but possibly non-Lipschitz)
- Score: 31.80268332522017
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We provide sharp path-dependent generalization and excess error guarantees
for the full-batch Gradient Decent (GD) algorithm for smooth losses (possibly
non-Lipschitz, possibly nonconvex). At the heart of our analysis is a novel
generalization error technique for deterministic symmetric algorithms, that
implies average output stability and a bounded expected gradient of the loss at
termination leads to generalization. This key result shows that small
generalization error occurs at stationary points, and allows us to bypass
Lipschitz assumptions on the loss prevalent in previous work. For nonconvex,
convex and strongly convex losses, we show the explicit dependence of the
generalization error in terms of the accumulated path-dependent optimization
error, terminal optimization error, number of samples, and number of
iterations. For nonconvex smooth losses, we prove that full-batch GD
efficiently generalizes close to any stationary point at termination, under the
proper choice of a decreasing step size. Further, if the loss is nonconvex but
the objective is PL, we derive vanishing bounds on the corresponding excess
risk. For convex and strongly-convex smooth losses, we prove that full-batch GD
generalizes even for large constant step sizes, and achieves a small excess
risk while training fast. Our full-batch GD generalization error and excess
risk bounds are significantly tighter than the existing bounds for (stochastic)
GD, when the loss is smooth (but possibly non-Lipschitz).
Related papers
- Estimating Generalization Performance Along the Trajectory of Proximal SGD in Robust Regression [4.150180443030652]
We introduce estimators that precisely track the generalization error of the iterates along the trajectory of the iterative algorithm.
The results are illustrated through several examples, including Huber regression, pseudo-Huber regression, and their penalized variants with non-smooth regularizer.
arXiv Detail & Related papers (2024-10-03T16:13:42Z) - Select without Fear: Almost All Mini-Batch Schedules Generalize
Optimally [38.3493773521059]
We establish matching upper and generalization error bounds for GD (GD) with either deterministic or otherwise independent data.
For smooth/non-adaptive non- losses, we show that full-batch (deterministic) GD is essentially optimal among possible batch schedules.
arXiv Detail & Related papers (2023-05-03T16:32:30Z) - A Non-Asymptotic Moreau Envelope Theory for High-Dimensional Generalized
Linear Models [33.36787620121057]
We prove a new generalization bound that shows for any class of linear predictors in Gaussian space.
We use our finite-sample bound to directly recover the "optimistic rate" of Zhou et al. (2021)
We show that application of our bound generalization using localized Gaussian width will generally be sharp for empirical risk minimizers.
arXiv Detail & Related papers (2022-10-21T16:16:55Z) - Private Stochastic Optimization With Large Worst-Case Lipschitz Parameter [14.040676498310198]
We study differentially private (DP) inequality optimization (SO) with loss functions whose worst-case Lipschitz parameter over all data may be infinite.
For smooth loss functions, we provide linear-time algorithms with state-of-the-art excess risk.
We also provide the first algorithm to handle non-optimal convex loss functions.
arXiv Detail & Related papers (2022-09-15T16:03:23Z) - Last Iterate Risk Bounds of SGD with Decaying Stepsize for
Overparameterized Linear Regression [122.70478935214128]
gradient descent (SGD) has been demonstrated to generalize well in many deep learning applications.
This paper provides problem-dependent analysis on the last iterate risk bounds of SGD with decaying stepsize.
arXiv Detail & Related papers (2021-10-12T17:49:54Z) - Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically.
This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression.
We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z) - Large-Scale Methods for Distributionally Robust Optimization [53.98643772533416]
We prove that our algorithms require a number of evaluations gradient independent of training set size and number of parameters.
Experiments on MNIST and ImageNet confirm the theoretical scaling of our algorithms, which are 9--36 times more efficient than full-batch methods.
arXiv Detail & Related papers (2020-10-12T17:41:44Z) - Balancing Rates and Variance via Adaptive Batch-Size for Stochastic
Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error.
Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z) - Gradient descent follows the regularization path for general losses [33.155195855431344]
We show that for empirical risk minimization over linear predictors with arbitrary convex losses, the gradient-descent path and the algorithm-independent regularization path converge to the same direction.
We provide a justification for the widely-used exponentially-tailed losses.
arXiv Detail & Related papers (2020-06-19T17:01:25Z) - Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses [52.039438701530905]
We provide sharp upper and lower bounds for several forms of gradient descent (SGD) on arbitrary Lipschitz nonsmooth convex losses.
Our bounds allow us to derive a new algorithm for differentially private nonsmooth convex optimization with optimal excess population risk.
arXiv Detail & Related papers (2020-06-12T02:45:21Z) - Approximation Schemes for ReLU Regression [80.33702497406632]
We consider the fundamental problem of ReLU regression.
The goal is to output the best fitting ReLU with respect to square loss given to draws from some unknown distribution.
arXiv Detail & Related papers (2020-05-26T16:26:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.