Butterfly Effects of SGD Noise: Error Amplification in Behavior Cloning
and Autoregression
- URL: http://arxiv.org/abs/2310.11428v1
- Date: Tue, 17 Oct 2023 17:39:40 GMT
- Title: Butterfly Effects of SGD Noise: Error Amplification in Behavior Cloning
and Autoregression
- Authors: Adam Block, Dylan J. Foster, Akshay Krishnamurthy, Max Simchowitz,
Cyril Zhang
- Abstract summary: We study training instabilities of behavior cloning with deep neural networks.
We observe that minibatch SGD updates to the policy network during training result in sharp oscillations in long-horizon rewards.
- Score: 70.78523583702209
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work studies training instabilities of behavior cloning with deep neural
networks. We observe that minibatch SGD updates to the policy network during
training result in sharp oscillations in long-horizon rewards, despite
negligibly affecting the behavior cloning loss. We empirically disentangle the
statistical and computational causes of these oscillations, and find them to
stem from the chaotic propagation of minibatch SGD noise through unstable
closed-loop dynamics. While SGD noise is benign in the single-step action
prediction objective, it results in catastrophic error accumulation over long
horizons, an effect we term gradient variance amplification (GVA). We show that
many standard mitigation techniques do not alleviate GVA, but find an
exponential moving average (EMA) of iterates to be surprisingly effective at
doing so. We illustrate the generality of this phenomenon by showing the
existence of GVA and its amelioration by EMA in both continuous control and
autoregressive language generation. Finally, we provide theoretical vignettes
that highlight the benefits of EMA in alleviating GVA and shed light on the
extent to which classical convex models can help in understanding the benefits
of iterate averaging in deep learning.
Related papers
- Per-Example Gradient Regularization Improves Learning Signals from Noisy
Data [25.646054298195434]
Empirical evidence suggests that gradient regularization technique can significantly enhance the robustness of deep learning models against noisy perturbations.
We present a theoretical analysis that demonstrates its effectiveness in improving both test error and robustness against noise perturbations.
Our analysis reveals that PEGR penalizes the variance of pattern learning, thus effectively suppressing the memorization of noises from the training data.
arXiv Detail & Related papers (2023-03-31T10:08:23Z) - SGD with Large Step Sizes Learns Sparse Features [22.959258640051342]
We showcase important features of the dynamics of the Gradient Descent (SGD) in the training of neural networks.
We show that the longer large step sizes keep SGD high in the loss landscape, the better the implicit regularization can operate and find sparse representations.
arXiv Detail & Related papers (2022-10-11T11:00:04Z) - On the Generalization of Stochastic Gradient Descent with Momentum [58.900860437254885]
We first show that there exists a convex loss function for which algorithmic stability fails to establish generalization guarantees.
For smooth Lipschitz loss functions, we analyze a modified momentum-based update rule, and show that it admits an upper-bound on the generalization error.
For the special case of strongly convex loss functions, we find a range of momentum such that multiple epochs of standard SGDM, as a special form of SGDEM, also generalizes.
arXiv Detail & Related papers (2021-02-26T18:58:29Z) - Asymmetric Heavy Tails and Implicit Bias in Gaussian Noise Injections [73.95786440318369]
We focus on the so-called implicit effect' of GNIs, which is the effect of the injected noise on the dynamics of gradient descent (SGD)
We show that this effect induces an asymmetric heavy-tailed noise on gradient updates.
We then formally prove that GNIs induce an implicit bias', which varies depending on the heaviness of the tails and the level of asymmetry.
arXiv Detail & Related papers (2021-02-13T21:28:09Z) - Noise and Fluctuation of Finite Learning Rate Stochastic Gradient
Descent [3.0079490585515343]
gradient descent (SGD) is relatively well understood in the vanishing learning rate regime.
We propose to study the basic properties of SGD and its variants in the non-vanishing learning rate regime.
arXiv Detail & Related papers (2020-12-07T12:31:43Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Dynamic of Stochastic Gradient Descent with State-Dependent Noise [84.64013284862733]
gradient descent (SGD) and its variants are mainstream methods to train deep neural networks.
We show that the covariance of the noise of SGD in the local region of the local minima is a quadratic function of the state.
We propose a novel power-law dynamic with state-dependent diffusion to approximate the dynamic of SGD.
arXiv Detail & Related papers (2020-06-24T13:34:38Z) - The Break-Even Point on Optimization Trajectories of Deep Neural
Networks [64.7563588124004]
We argue for the existence of the "break-even" point on this trajectory.
We show that using a large learning rate in the initial phase of training reduces the variance of the gradient.
We also show that using a low learning rate results in bad conditioning of the loss surface even for a neural network with batch normalization layers.
arXiv Detail & Related papers (2020-02-21T22:55:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.