Related papers: A Study of Gradient Variance in Deep Learning

A Study of Gradient Variance in Deep Learning

URL: http://arxiv.org/abs/2007.04532v1
Date: Thu, 9 Jul 2020 03:23:10 GMT
Title: A Study of Gradient Variance in Deep Learning
Authors: Fartash Faghri, David Duvenaud, David J. Fleet, Jimmy Ba
Abstract summary: We introduce a method, Gradient Clustering, to minimize the variance of average mini-batch gradient with stratified sampling. We measure the gradient variance on common deep learning benchmarks and observe that, contrary to common assumptions, gradient variance increases during training.
Score: 56.437755740715396
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The impact of gradient noise on training deep models is widely acknowledged but not well understood. In this context, we study the distribution of gradients during training. We introduce a method, Gradient Clustering, to minimize the variance of average mini-batch gradient with stratified sampling. We prove that the variance of average mini-batch gradient is minimized if the elements are sampled from a weighted clustering in the gradient space. We measure the gradient variance on common deep learning benchmarks and observe that, contrary to common assumptions, gradient variance increases during training, and smaller learning rates coincide with higher variance. In addition, we introduce normalized gradient variance as a statistic that better correlates with the speed of convergence compared to gradient variance.

Related papers

Gradient Equilibrium in Online Learning: Theory and Applications [56.02856551198923]
gradient equilibrium is achieved by standard online learning methods. gradient equilibrium translates into an interpretable and meaningful property in online prediction problems. We show that gradient equilibrium framework can be used to develop a debiasing scheme for black-box predictions.
arXiv Detail & Related papers (2025-01-14T18:59:09Z)
Pathwise Gradient Variance Reduction with Control Variates in Variational Inference [2.1638817206926855]
Variational inference in Bayesian deep learning often involves computing the gradient of an expectation that lacks a closed-form solution. In these cases, pathwise and score-function gradient estimators are the most common approaches. Recent research suggests that even pathwise gradient estimators could benefit from variance reduction.
arXiv Detail & Related papers (2024-10-08T07:28:46Z)
Preferential Subsampling for Stochastic Gradient Langevin Dynamics [3.158346511479111]
gradient MCMC offers an unbiased estimate of the gradient of the log-posterior with a small, uniformly-weighted subsample of the data. The resulting gradient estimator may exhibit a high variance and impact sampler performance. We demonstrate that such an approach can maintain the same level of accuracy while substantially reducing the average subsample size that is used.
arXiv Detail & Related papers (2022-10-28T14:56:18Z)
The Equalization Losses: Gradient-Driven Training for Long-tailed Object Recognition [84.51875325962061]
We propose a gradient-driven training mechanism to tackle the long-tail problem. We introduce a new family of gradient-driven loss functions, namely equalization losses. Our method consistently outperforms the baseline models.
arXiv Detail & Related papers (2022-10-11T16:00:36Z)
Adaptive Perturbation-Based Gradient Estimation for Discrete Latent Variable Models [28.011868604717726]
We present Adaptive IMLE, the first adaptive gradient estimator for complex discrete distributions. We show that our estimator can produce faithful estimates while requiring orders of magnitude fewer samples than other gradient estimators.
arXiv Detail & Related papers (2022-09-11T13:32:39Z)
Grad-GradaGrad? A Non-Monotone Adaptive Stochastic Gradient Method [17.275654092947647]
We introduce GradaGrad, a method in the same family that naturally grows or shrinks the learning rate based on a different accumulation in the denominator. We show that it obeys a similar convergence rate as AdaGrad and demonstrate its non-monotone adaptation capability with experiments.
arXiv Detail & Related papers (2022-06-14T14:55:27Z)
Differentiable Annealed Importance Sampling and the Perils of Gradient Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation. Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective. We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z)
On the Origin of Implicit Regularization in Stochastic Gradient Descent [22.802683068658897]
gradient descent (SGD) follows the path of gradient flow on the full batch loss function. We prove that for SGD with random shuffling, the mean SGD iterate also stays close to the path of gradient flow if the learning rate is small and finite. We verify that explicitly including the implicit regularizer in the loss can enhance the test accuracy when the learning rate is small.
arXiv Detail & Related papers (2021-01-28T18:32:14Z)
Unbiased Risk Estimators Can Mislead: A Case Study of Learning with Complementary Labels [92.98756432746482]
We study a weakly supervised problem called learning with complementary labels. We show that the quality of gradient estimation matters more in risk minimization. We propose a novel surrogate complementary loss(SCL) framework that trades zero bias with reduced variance.
arXiv Detail & Related papers (2020-07-05T04:19:37Z)
Path Sample-Analytic Gradient Estimators for Stochastic Binary Networks [78.76880041670904]
In neural networks with binary activations and or binary weights the training by gradient descent is complicated. We propose a new method for this estimation problem combining sampling and analytic approximation steps. We experimentally show higher accuracy in gradient estimation and demonstrate a more stable and better performing training in deep convolutional models.
arXiv Detail & Related papers (2020-06-04T21:51:21Z)
Carath\'eodory Sampling for Stochastic Gradient Descent [79.55586575988292]
We present an approach that is inspired by classical results of Tchakaloff and Carath'eodory about measure reduction. We adaptively select the descent steps where the measure reduction is carried out. We combine this with Block Coordinate Descent so that measure reduction can be done very cheaply.
arXiv Detail & Related papers (2020-06-02T17:52:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.