What Happens after SGD Reaches Zero Loss? --A Mathematical Framework
- URL: http://arxiv.org/abs/2110.06914v1
- Date: Wed, 13 Oct 2021 17:50:46 GMT
- Title: What Happens after SGD Reaches Zero Loss? --A Mathematical Framework
- Authors: Zhiyuan Li, Tianhao Wang, Sanjeev Arora
- Abstract summary: Understanding the implicit bias of Gradient Descent (SGD) is one of the key challenges in deep learning.
This paper gives a general framework for such analysis by adapting ideas from Katzenberger (1991).
It yields some new results: (1) a global analysis of the implicit bias valid for $eta-2$ steps, in contrast to the local analysis of Blanc et al. ( 2020) that is only valid for $eta-1.6$ steps and (2) allowing arbitrary noise covariance.
- Score: 35.31946061894308
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding the implicit bias of Stochastic Gradient Descent (SGD) is one
of the key challenges in deep learning, especially for overparametrized models,
where the local minimizers of the loss function $L$ can form a manifold.
Intuitively, with a sufficiently small learning rate $\eta$, SGD tracks
Gradient Descent (GD) until it gets close to such manifold, where the gradient
noise prevents further convergence. In such a regime, Blanc et al. (2020)
proved that SGD with label noise locally decreases a regularizer-like term, the
sharpness of loss, $\mathrm{tr}[\nabla^2 L]$. The current paper gives a general
framework for such analysis by adapting ideas from Katzenberger (1991). It
allows in principle a complete characterization for the regularization effect
of SGD around such manifold -- i.e., the "implicit bias" -- using a stochastic
differential equation (SDE) describing the limiting dynamics of the parameters,
which is determined jointly by the loss function and the noise covariance. This
yields some new results: (1) a global analysis of the implicit bias valid for
$\eta^{-2}$ steps, in contrast to the local analysis of Blanc et al. (2020)
that is only valid for $\eta^{-1.6}$ steps and (2) allowing arbitrary noise
covariance. As an application, we show with arbitrary large initialization,
label noise SGD can always escape the kernel regime and only requires
$O(\kappa\ln d)$ samples for learning an $\kappa$-sparse overparametrized
linear model in $\mathbb{R}^d$ (Woodworth et al., 2020), while GD initialized
in the kernel regime requires $\Omega(d)$ samples. This upper bound is minimax
optimal and improves the previous $\tilde{O}(\kappa^2)$ upper bound (HaoChen et
al., 2020).
Related papers
- A Unified Framework for Uniform Signal Recovery in Nonlinear Generative
Compressed Sensing [68.80803866919123]
Under nonlinear measurements, most prior results are non-uniform, i.e., they hold with high probability for a fixed $mathbfx*$ rather than for all $mathbfx*$ simultaneously.
Our framework accommodates GCS with 1-bit/uniformly quantized observations and single index models as canonical examples.
We also develop a concentration inequality that produces tighter bounds for product processes whose index sets have low metric entropy.
arXiv Detail & Related papers (2023-09-25T17:54:19Z) - Horizon-Free and Variance-Dependent Reinforcement Learning for Latent
Markov Decision Processes [62.90204655228324]
We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight.
We design a novel model-based algorithmic framework which can be instantiated with both a model-optimistic and a value-optimistic solver.
arXiv Detail & Related papers (2022-10-20T21:32:01Z) - Topology-aware Generalization of Decentralized SGD [89.25765221779288]
This paper studies the generalizability of decentralized Valpha-10 stability descent (D-SGD)
We prove that the generalizability of D-SGD has a positive correlation with connectivity in initial training phase.
arXiv Detail & Related papers (2022-06-25T16:03:48Z) - Sharper Rates and Flexible Framework for Nonconvex SGD with Client and
Data Sampling [64.31011847952006]
We revisit the problem of finding an approximately stationary point of the average of $n$ smooth and possibly non-color functions.
We generalize the $smallsfcolorgreen$ so that it can provably work with virtually any sampling mechanism.
We provide the most general and most accurate analysis of optimal bound in the smooth non-color regime.
arXiv Detail & Related papers (2022-06-05T21:32:33Z) - Generalization Bounds for Gradient Methods via Discrete and Continuous
Prior [8.76346911214414]
We show a new high probability generalization bound of order $O(frac1n + fracL2n2sum_t=1T(gamma_t/varepsilon_t)2)$ for gradient Langevin Dynamics (GLD)
We can also obtain new bounds for certain variants of SGD.
arXiv Detail & Related papers (2022-05-27T07:23:01Z) - Understanding Gradient Descent on Edge of Stability in Deep Learning [32.03036040349019]
This paper mathematically analyzes a new mechanism of implicit regularization in the EoS phase, whereby GD updates due to non-smooth loss landscape turn out to evolve along some deterministic flow on the manifold of minimum loss.
The above theoretical results have been corroborated by an experimental study.
arXiv Detail & Related papers (2022-05-19T17:57:01Z) - A Variance-Reduced Stochastic Accelerated Primal Dual Algorithm [3.2958527541557525]
Such problems arise frequently in machine learning in the context of robust empirical risk minimization.
We consider the accelerated primal dual (SAPD) algorithm as a robust method against gradient noise.
We show that our method improves upon SAPD both in practice and in theory.
arXiv Detail & Related papers (2022-02-19T22:12:30Z) - Black-Box Generalization [31.80268332522017]
We provide the first error analysis for black-box learning through derivative generalization.
We show both generalization are independent $d$, $K$ and under appropriate choices a slightly decreased learning rate.
arXiv Detail & Related papers (2022-02-14T17:14:48Z) - Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and
Variance Reduction [63.41789556777387]
Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP)
We show that the number of samples needed to yield an entrywise $varepsilon$-accurate estimate of the Q-function is at most on the order of $frac1mu_min (1-gamma)5varepsilon2+ fract_mixmu_min (1-gamma)$ up to some logarithmic factor.
arXiv Detail & Related papers (2020-06-04T17:51:00Z) - Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.