Stability vs Implicit Bias of Gradient Methods on Separable Data and
Beyond
- URL: http://arxiv.org/abs/2202.13441v1
- Date: Sun, 27 Feb 2022 19:56:36 GMT
- Title: Stability vs Implicit Bias of Gradient Methods on Separable Data and
Beyond
- Authors: Matan Schliserman and Tomer Koren
- Abstract summary: We focus on the generalization properties of unregularized gradient-based learning procedures applied to separable linear classification.
We give an additional unified explanation for this generalization, that we refer to as realizability and self-boundedness.
In some of these cases, our bounds significantly improve upon the existing generalization error bounds in the literature.
- Score: 33.593203156666746
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: An influential line of recent work has focused on the generalization
properties of unregularized gradient-based learning procedures applied to
separable linear classification with exponentially-tailed loss functions. The
ability of such methods to generalize well has been attributed to the their
implicit bias towards large margin predictors, both asymptotically as well as
in finite time. We give an additional unified explanation for this
generalization and relate it to two simple properties of the optimization
objective, that we refer to as realizability and self-boundedness. We introduce
a general setting of unconstrained stochastic convex optimization with these
properties, and analyze generalization of gradient methods through the lens of
algorithmic stability. In this broader setting, we obtain sharp stability
bounds for gradient descent and stochastic gradient descent which apply even
for a very large number of gradient steps, and use them to derive general
generalization bounds for these algorithms. Finally, as direct applications of
the general bounds, we return to the setting of linear classification with
separable data and establish several novel test loss and test accuracy bounds
for gradient descent and stochastic gradient descent for a variety of loss
functions with different tail decay rates. In some of these case, our bounds
significantly improve upon the existing generalization error bounds in the
literature.
Related papers
- Gradient-Variation Online Learning under Generalized Smoothness [56.38427425920781]
gradient-variation online learning aims to achieve regret guarantees that scale with variations in gradients of online functions.
Recent efforts in neural network optimization suggest a generalized smoothness condition, allowing smoothness to correlate with gradient norms.
We provide the applications for fast-rate convergence in games and extended adversarial optimization.
arXiv Detail & Related papers (2024-08-17T02:22:08Z) - Non-Uniform Smoothness for Gradient Descent [5.64297382055816]
We introduce a local first-order smoothness oracle (LFSO) which generalizes the Lipschitz continuous gradient smoothness condition.
We show that this oracle can encode all relevant problem information for tuning stepsizes for a suitably modified gradient descent method.
We also show that LFSOs in this modified first-order method can yield global linear convergence rates for non-strongly convex problems with extremely flat minima.
arXiv Detail & Related papers (2023-11-15T00:44:08Z) - The Inductive Bias of Flatness Regularization for Deep Matrix
Factorization [58.851514333119255]
This work takes the first step toward understanding the inductive bias of the minimum trace of the Hessian solutions in deep linear networks.
We show that for all depth greater than one, with the standard Isometry Property (RIP) on the measurements, minimizing the trace of Hessian is approximately equivalent to minimizing the Schatten 1-norm of the corresponding end-to-end matrix parameters.
arXiv Detail & Related papers (2023-06-22T23:14:57Z) - Convex and Non-convex Optimization Under Generalized Smoothness [69.69521650503431]
An analysis of convex and non- optimization methods often requires the Lipsitzness gradient, which limits the analysis by this trajectorys.
Recent work generalizes the gradient setting via the non-uniform smoothness condition.
arXiv Detail & Related papers (2023-06-02T04:21:59Z) - Nonconvex Stochastic Scaled-Gradient Descent and Generalized Eigenvector
Problems [98.34292831923335]
Motivated by the problem of online correlation analysis, we propose the emphStochastic Scaled-Gradient Descent (SSD) algorithm.
We bring these ideas together in an application to online correlation analysis, deriving for the first time an optimal one-time-scale algorithm with an explicit rate of local convergence to normality.
arXiv Detail & Related papers (2021-12-29T18:46:52Z) - High-probability Bounds for Non-Convex Stochastic Optimization with
Heavy Tails [55.561406656549686]
We consider non- Hilbert optimization using first-order algorithms for which the gradient estimates may have tails.
We show that a combination of gradient, momentum, and normalized gradient descent convergence to critical points in high-probability with best-known iteration for smooth losses.
arXiv Detail & Related papers (2021-06-28T00:17:01Z) - Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically.
This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression.
We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z) - Stochasticity helps to navigate rough landscapes: comparing
gradient-descent-based algorithms in the phase retrieval problem [8.164433158925593]
We investigate how analytically-based algorithms such as dynamical descent, its persistent gradient, and Langevin landscape descent can be studied.
We apply statistical-field theory from statistical trajectories to these algorithms in their full-time limit, with a start, and for large system sizes.
arXiv Detail & Related papers (2021-03-08T17:06:18Z) - Exact Linear Convergence Rate Analysis for Low-Rank Symmetric Matrix
Completion via Gradient Descent [22.851500417035947]
Factorization-based gradient descent is a scalable and efficient algorithm for solving the factorrank matrix completion.
We show that gradient descent enjoys fast convergence to estimate a solution of the global nature problem.
arXiv Detail & Related papers (2021-02-04T03:41:54Z) - Information-Theoretic Generalization Bounds for Stochastic Gradient
Descent [13.757095663704858]
We provide bounds on the technical error that depend on local statistics.
Key factors are the objective variance of the gradients, the smoothness of the gradients, and the sensitivity of the loss function to perturbations.
Our key tool is combining the information-theoretic generalization bounds previously used for analyzing randomized variants of SGD with perturbation analysis of the paths.
arXiv Detail & Related papers (2021-02-01T16:00:34Z) - Implicit Gradient Regularization [18.391141066502644]
gradient descent can be surprisingly good at optimizing deep neural networks without overfitting and without explicit regularization.
We call this Implicit Gradient Regularization (IGR) and we use backward error analysis to calculate the size of this regularization.
arXiv Detail & Related papers (2020-09-23T14:17:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.