On the Overlooked Structure of Stochastic Gradients
- URL: http://arxiv.org/abs/2212.02083v3
- Date: Fri, 20 Oct 2023 11:29:13 GMT
- Title: On the Overlooked Structure of Stochastic Gradients
- Authors: Zeke Xie, Qian-Yuan Tang, Mingming Sun, Ping Li
- Abstract summary: We show that dimension-wise gradients usually exhibit power-law heavy tails, while iteration-wise gradients and gradient noise caused by minibatch training usually do not exhibit power-law heavy tails.
Our work challenges the existing belief and provides novel insights on the structure of gradients in deep learning.
- Score: 34.650998241703626
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stochastic gradients closely relate to both optimization and generalization
of deep neural networks (DNNs). Some works attempted to explain the success of
stochastic optimization for deep learning by the arguably heavy-tail properties
of gradient noise, while other works presented theoretical and empirical
evidence against the heavy-tail hypothesis on gradient noise. Unfortunately,
formal statistical tests for analyzing the structure and heavy tails of
stochastic gradients in deep learning are still under-explored. In this paper,
we mainly make two contributions. First, we conduct formal statistical tests on
the distribution of stochastic gradients and gradient noise across both
parameters and iterations. Our statistical tests reveal that dimension-wise
gradients usually exhibit power-law heavy tails, while iteration-wise gradients
and stochastic gradient noise caused by minibatch training usually do not
exhibit power-law heavy tails. Second, we further discover that the covariance
spectra of stochastic gradients have the power-law structures overlooked by
previous studies and present its theoretical implications for training of DNNs.
While previous studies believed that the anisotropic structure of stochastic
gradients matters to deep learning, they did not expect the gradient covariance
can have such an elegant mathematical structure. Our work challenges the
existing belief and provides novel insights on the structure of stochastic
gradients in deep learning.
Related papers
- Limit Theorems for Stochastic Gradient Descent with Infinite Variance [47.87144151929621]
We show that the gradient descent algorithm can be characterized as the stationary distribution of a suitably defined Ornstein-rnstein process driven by an appropriate L'evy process.
We also explore the applications of these results in linear regression and logistic regression models.
arXiv Detail & Related papers (2024-10-21T09:39:10Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - On Training Implicit Models [75.20173180996501]
We propose a novel gradient estimate for implicit models, named phantom gradient, that forgoes the costly computation of the exact gradient.
Experiments on large-scale tasks demonstrate that these lightweight phantom gradients significantly accelerate the backward passes in training implicit models by roughly 1.7 times.
arXiv Detail & Related papers (2021-11-09T14:40:24Z) - Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of
Stochasticity [24.428843425522107]
We study the dynamics of gradient descent over diagonal linear networks through its continuous time version, namely gradient flow.
We show that the convergence speed of the training loss controls the magnitude of the biasing effect: the slower the convergence, the better the bias.
arXiv Detail & Related papers (2021-06-17T14:16:04Z) - Gradient Starvation: A Learning Proclivity in Neural Networks [97.02382916372594]
Gradient Starvation arises when cross-entropy loss is minimized by capturing only a subset of features relevant for the task.
This work provides a theoretical explanation for the emergence of such feature imbalance in neural networks.
arXiv Detail & Related papers (2020-11-18T18:52:08Z) - Reintroducing Straight-Through Estimators as Principled Methods for
Stochastic Binary Networks [85.94999581306827]
Training neural networks with binary weights and activations is a challenging problem due to the lack of gradients and difficulty of optimization over discrete weights.
Many successful experimental results have been achieved with empirical straight-through (ST) approaches.
At the same time, ST methods can be truly derived as estimators in the binary network (SBN) model with Bernoulli weights.
arXiv Detail & Related papers (2020-06-11T23:58:18Z) - Path Sample-Analytic Gradient Estimators for Stochastic Binary Networks [78.76880041670904]
In neural networks with binary activations and or binary weights the training by gradient descent is complicated.
We propose a new method for this estimation problem combining sampling and analytic approximation steps.
We experimentally show higher accuracy in gradient estimation and demonstrate a more stable and better performing training in deep convolutional models.
arXiv Detail & Related papers (2020-06-04T21:51:21Z) - Coherent Gradients: An Approach to Understanding Generalization in
Gradient Descent-based Optimization [15.2292571922932]
We propose an approach to answering this question based on a hypothesis about the dynamics of gradient descent.
We show that changes to the network parameters during training are biased towards those that (locally) simultaneously benefit many examples.
arXiv Detail & Related papers (2020-02-25T03:59:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.