Stability of SGD: Tightness Analysis and Improved Bounds
- URL: http://arxiv.org/abs/2102.05274v1
- Date: Wed, 10 Feb 2021 05:43:27 GMT
- Title: Stability of SGD: Tightness Analysis and Improved Bounds
- Authors: Yikai Zhang, Wenjia Zhang, Sammy Bald, Vamsi Pingali, Chao Chen,
Mayank Goswami
- Abstract summary: Gradient Descent (SGD) based methods have been widely used for training largescale machine learning models that generalize well in practice.
This paper addresses the question: is analysis [18] tight for smooth functions, and if not, for what kind of loss and data can the analysis improved?
- Score: 8.831597193643628
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Stochastic Gradient Descent (SGD) based methods have been widely used for
training large-scale machine learning models that also generalize well in
practice. Several explanations have been offered for this generalization
performance, a prominent one being algorithmic stability [18]. However, there
are no known examples of smooth loss functions for which the analysis can be
shown to be tight. Furthermore, apart from the properties of the loss function,
data distribution has also been shown to be an important factor in
generalization performance. This raises the question: is the stability analysis
of [18] tight for smooth functions, and if not, for what kind of loss functions
and data distributions can the stability analysis be improved? In this paper we
first settle open questions regarding tightness of bounds in the
data-independent setting: we show that for general datasets, the existing
analysis for convex and strongly-convex loss functions is tight, but it can be
improved for non-convex loss functions. Next, we give a novel and improved
data-dependent bounds: we show stability upper bounds for a large class of
convex regularized loss functions, with negligible regularization parameters,
and improve existing data-dependent bounds in the non-convex setting. We hope
that our results will initiate further efforts to better understand the
data-dependent setting under non-convex loss functions, leading to an improved
understanding of the generalization abilities of deep networks.
Related papers
- Error Feedback under $(L_0,L_1)$-Smoothness: Normalization and Momentum [56.37522020675243]
We provide the first proof of convergence for normalized error feedback algorithms across a wide range of machine learning problems.
We show that due to their larger allowable stepsizes, our new normalized error feedback algorithms outperform their non-normalized counterparts on various tasks.
arXiv Detail & Related papers (2024-10-22T10:19:27Z) - LEARN: An Invex Loss for Outlier Oblivious Robust Online Optimization [56.67706781191521]
An adversary can introduce outliers by corrupting loss functions in an arbitrary number of k, unknown to the learner.
We present a robust online rounds optimization framework, where an adversary can introduce outliers by corrupting loss functions in an arbitrary number of k, unknown.
arXiv Detail & Related papers (2024-08-12T17:08:31Z) - A Precise Characterization of SGD Stability Using Loss Surface Geometry [8.942671556572073]
Descent Gradient (SGD) stands as a cornerstone optimization algorithm with proven real-world empirical successes but relatively limited theoretical understanding.
Recent research has illuminated a key factor contributing to its practical efficacy: the implicit regularization it instigates.
arXiv Detail & Related papers (2024-01-22T19:46:30Z) - Time-Independent Information-Theoretic Generalization Bounds for SGLD [4.73194777046253]
We provide novel information-theoretic generalization bounds for Langevin dynamics datasets.
Our bounds are based on the assumptions of smoothness and dissipation, and are non-exponential.
arXiv Detail & Related papers (2023-11-02T07:42:23Z) - A general sample complexity analysis of vanilla policy gradient [101.16957584135767]
Policy gradient (PG) is one of the most popular reinforcement learning (RL) problems.
"vanilla" theoretical understanding of PG trajectory is one of the most popular methods for solving RL problems.
arXiv Detail & Related papers (2021-07-23T19:38:17Z) - Optimal Rates for Random Order Online Optimization [60.011653053877126]
We study the citetgarber 2020online, where the loss functions may be chosen by an adversary, but are then presented online in a uniformly random order.
We show that citetgarber 2020online algorithms achieve the optimal bounds and significantly improve their stability.
arXiv Detail & Related papers (2021-06-29T09:48:46Z) - The Sobolev Regularization Effect of Stochastic Gradient Descent [8.193914488276468]
We show that flat minima regularize the gradient of the model function, which explains the good performance of flat minima.
We also consider high-order moments of gradient noise, and show that Gradient Dascent (SGD) tends to impose constraints on these moments by a linear analysis of SGD around global minima.
arXiv Detail & Related papers (2021-05-27T21:49:21Z) - SGD for Structured Nonconvex Functions: Learning Rates, Minibatching and
Interpolation [17.199023009789308]
The Expected assumption of SGD (SGD) is being used routinely for non-artisan functions.
In this paper, we show a paradigms for convergence to a smooth non-linear setting.
We also provide theoretical guarantees for different step-size conditions.
arXiv Detail & Related papers (2020-06-18T07:05:56Z) - Fine-Grained Analysis of Stability and Generalization for Stochastic
Gradient Descent [55.85456985750134]
We introduce a new stability measure called on-average model stability, for which we develop novel bounds controlled by the risks of SGD iterates.
This yields generalization bounds depending on the behavior of the best model, and leads to the first-ever-known fast bounds in the low-noise setting.
To our best knowledge, this gives the firstever-known stability and generalization for SGD with even non-differentiable loss functions.
arXiv Detail & Related papers (2020-06-15T06:30:19Z) - Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses [52.039438701530905]
We provide sharp upper and lower bounds for several forms of gradient descent (SGD) on arbitrary Lipschitz nonsmooth convex losses.
Our bounds allow us to derive a new algorithm for differentially private nonsmooth convex optimization with optimal excess population risk.
arXiv Detail & Related papers (2020-06-12T02:45:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.