Resonance in Weight Space: Covariate Shift Can Drive Divergence of SGD
with Momentum
- URL: http://arxiv.org/abs/2203.11992v1
- Date: Tue, 22 Mar 2022 18:38:13 GMT
- Title: Resonance in Weight Space: Covariate Shift Can Drive Divergence of SGD
with Momentum
- Authors: Kirby Banman, Liam Peet-Pare, Nidhi Hegde, Alona Fyshe, Martha White
- Abstract summary: Existing work has shown that SGDm with a decaying step-size can converge under Markovian temporal correlation.
In this work, we show that SGDm under covariate shift with a fixed step-size can be unstable and diverge.
We approximate the learning system as a time varying system of ordinary differential equations, and leverage existing theory to characterize the system's divergence/convergence as resonant/nonresonant modes.
- Score: 26.25434025410027
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Most convergence guarantees for stochastic gradient descent with momentum
(SGDm) rely on iid sampling. Yet, SGDm is often used outside this regime, in
settings with temporally correlated input samples such as continual learning
and reinforcement learning. Existing work has shown that SGDm with a decaying
step-size can converge under Markovian temporal correlation. In this work, we
show that SGDm under covariate shift with a fixed step-size can be unstable and
diverge. In particular, we show SGDm under covariate shift is a parametric
oscillator, and so can suffer from a phenomenon known as resonance. We
approximate the learning system as a time varying system of ordinary
differential equations, and leverage existing theory to characterize the
system's divergence/convergence as resonant/nonresonant modes. The theoretical
result is limited to the linear setting with periodic covariate shift, so we
empirically supplement this result to show that resonance phenomena persist
even under non-periodic covariate shift, nonlinear dynamics with neural
networks, and optimizers other than SGDm.
Related papers
- Probing Dynamical Sensitivity of a Non-KAM System Through
Out-of-Time-Order Correlators [0.0]
Non-KAM systems offer a fast route to classical chaos through an abrupt breaking of invariant phase space tori.
We employ out-of-time-order correlators (OTOCs) to study the dynamical sensitivity of a perturbed non-KAM system in the quantum limit.
Our findings suggest that the short-time dynamics remain relatively more stable and show the exponential growth found in the literature for unstable fixed points.
arXiv Detail & Related papers (2023-06-07T07:31:16Z) - Machine learning in and out of equilibrium [58.88325379746631]
Our study uses a Fokker-Planck approach, adapted from statistical physics, to explore these parallels.
We focus in particular on the stationary state of the system in the long-time limit, which in conventional SGD is out of equilibrium.
We propose a new variation of Langevin dynamics (SGLD) that harnesses without replacement minibatching.
arXiv Detail & Related papers (2023-06-06T09:12:49Z) - Doubly Stochastic Models: Learning with Unbiased Label Noises and
Inference Stability [85.1044381834036]
We investigate the implicit regularization effects of label noises under mini-batch sampling settings of gradient descent.
We find such implicit regularizer would favor some convergence points that could stabilize model outputs against perturbation of parameters.
Our work doesn't assume SGD as an Ornstein-Uhlenbeck like process and achieve a more general result with convergence of approximation proved.
arXiv Detail & Related papers (2023-04-01T14:09:07Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - The Limiting Dynamics of SGD: Modified Loss, Phase Space Oscillations,
and Anomalous Diffusion [29.489737359897312]
We study the limiting dynamics of deep neural networks trained with gradient descent (SGD)
We show that the key ingredient driving these dynamics is not the original training loss, but rather the combination of a modified loss, which implicitly regularizes the velocity and probability currents, which cause oscillations in phase space.
arXiv Detail & Related papers (2021-07-19T20:18:57Z) - Understanding Long Range Memory Effects in Deep Neural Networks [10.616643031188248]
textitstochastic gradient descent (SGD) is of fundamental importance in deep learning.
In this study, we argue that SGN is neither Gaussian nor stable. Instead, we propose that SGD can be viewed as a discretization of an SDE driven by textitfractional Brownian motion (FBM)
arXiv Detail & Related papers (2021-05-05T13:54:26Z) - On the Generalization of Stochastic Gradient Descent with Momentum [58.900860437254885]
We first show that there exists a convex loss function for which algorithmic stability fails to establish generalization guarantees.
For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, and show that it admits an upper-bound on the generalization error.
For the special case of strongly convex loss functions, we find a range of momentum such that multiple epochs of standard SGDM, as a special form of SGDEM, also generalizes.
arXiv Detail & Related papers (2021-02-26T18:58:29Z) - Time-Reversal Symmetric ODE Network [138.02741983098454]
Time-reversal symmetry is a fundamental property that frequently holds in classical and quantum mechanics.
We propose a novel loss function that measures how well our ordinary differential equation (ODE) networks comply with this time-reversal symmetry.
We show that, even for systems that do not possess the full time-reversal symmetry, TRS-ODENs can achieve better predictive performances over baselines.
arXiv Detail & Related papers (2020-07-22T12:19:40Z) - Signatures of quantum chaos transition in short spin chains [0.0]
The study of the long-time oscillations of the out-of-time-ordered correlator (OTOC) appears as a versatile tool, that can be adapted to the case of systems with a small number of degrees of freedom.
We show that the systematic of the OTOC oscillations describes well, in a chain with only 4 spins, the integra-to-chaos transition inherited from the infinite chain.
arXiv Detail & Related papers (2020-04-29T19:13:58Z) - Sparse and Smooth: improved guarantees for Spectral Clustering in the
Dynamic Stochastic Block Model [12.538755088321404]
We analyse classical variants of the Spectral Clustering (SC) algorithm in the Dynamic Block Model (DSBM)
Existing results show that, in the relatively sparse case where the expected degree grows logarithmically with the number of nodes, guarantees in the static case can be extended to the dynamic case.
We improve over these results by drawing a new link between the sparsity and the smoothness of the DSBM.
arXiv Detail & Related papers (2020-02-07T16:49:25Z) - On the Generalization of Stochastic Gradient Descent with Momentum [84.54924994010703]
momentum-based accelerated variants of gradient descent (SGD) are widely used when training machine learning models.
We first show that there exists a convex loss function for which the stability gap for multiple epochs of SGD with standard heavy-ball momentum (SGDM) becomes unbounded.
For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, i.e., SGD with early momentum (SGDEM) under a broad range of step-sizes.
arXiv Detail & Related papers (2018-09-12T17:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.