The effective noise of Stochastic Gradient Descent
- URL:
- Date: Mon, 20 Dec 2021 20:46:19 GMT
- Title: The effective noise of Stochastic Gradient Descent
- Authors: Francesca Mignacco, Pierfrancesco Urbani
- Abstract summary: Gradient Descent (SGD) is the workhorse algorithm of deep learning technology.
We characterize the parameters of SGD and a recently-introduced variant, persistent SGD, in a neural network model.
We find that noisier algorithms lead to wider decision boundaries of the corresponding constraint satisfaction problem.
- Score: 9.645196221785694
- License:
- Abstract: Stochastic Gradient Descent (SGD) is the workhorse algorithm of deep learning
technology. At each step of the training phase, a mini batch of samples is
drawn from the training dataset and the weights of the neural network are
adjusted according to the performance on this specific subset of examples. The
mini-batch sampling procedure introduces a stochastic dynamics to the gradient
descent, with a non-trivial state-dependent noise. We characterize the
stochasticity of SGD and a recently-introduced variant, persistent SGD, in a
prototypical neural network model. In the under-parametrized regime, where the
final training error is positive, the SGD dynamics reaches a stationary state
and we define an effective temperature from the fluctuation-dissipation
theorem, computed from dynamical mean-field theory. We use the effective
temperature to quantify the magnitude of the SGD noise as a function of the
problem parameters. In the over-parametrized regime, where the training error
vanishes, we measure the noise magnitude of SGD by computing the average
distance between two replicas of the system with the same initialization and
two different realizations of SGD noise. We find that the two noise measures
behave similarly as a function of the problem parameters. Moreover, we observe
that noisier algorithms lead to wider decision boundaries of the corresponding
constraint satisfaction problem.
Related papers
- Accelerated zero-order SGD under high-order smoothness and overparameterized regime [79.85163929026146]
We present a novel gradient-free algorithm to solve convex optimization problems.
Such problems are encountered in medicine, physics, and machine learning.
We provide convergence guarantees for the proposed algorithm under both types of noise.
arXiv Detail & Related papers (2024-11-21T10:26:17Z) - Machine learning in and out of equilibrium [58.88325379746631]
Our study uses a Fokker-Planck approach, adapted from statistical physics, to explore these parallels.
We focus in particular on the stationary state of the system in the long-time limit, which in conventional SGD is out of equilibrium.
We propose a new variation of Langevin dynamics (SGLD) that harnesses without replacement minibatching.
arXiv Detail & Related papers (2023-06-06T09:12:49Z) - Stochastic Gradient Langevin Dynamics Based on Quantization with
Increasing Resolution [0.0]
We propose an alternative descent learning equation based on quantized optimization for non- objective functions.
We demonstrate the effectiveness of the proposed on vanilla neural convolution neural(CNN) models and the architecture across various data sets.
arXiv Detail & Related papers (2023-05-30T08:55:59Z) - Doubly Stochastic Models: Learning with Unbiased Label Noises and
Inference Stability [85.1044381834036]
We investigate the implicit regularization effects of label noises under mini-batch sampling settings of gradient descent.
We find such implicit regularizer would favor some convergence points that could stabilize model outputs against perturbation of parameters.
Our work doesn't assume SGD as an Ornstein-Uhlenbeck like process and achieve a more general result with convergence of approximation proved.
arXiv Detail & Related papers (2023-04-01T14:09:07Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - Latent Class-Conditional Noise Model [54.56899309997246]
We introduce a Latent Class-Conditional Noise model (LCCN) to parameterize the noise transition under a Bayesian framework.
We then deduce a dynamic label regression method for LCCN, whose Gibbs sampler allows us efficiently infer the latent true labels.
Our approach safeguards the stable update of the noise transition, which avoids previous arbitrarily tuning from a mini-batch of samples.
arXiv Detail & Related papers (2023-02-19T15:24:37Z) - Computing the Variance of Shuffling Stochastic Gradient Algorithms via
Power Spectral Density Analysis [6.497816402045099]
Two common alternatives to gradient descent (SGD) with theoretical benefits are random reshuffling (SGDRR) and shuffle-once (SGD-SO)
We study the stationary variances of SGD, SGDRR and SGD-SO, whose leading terms decrease in this order, and obtain simple approximations.
arXiv Detail & Related papers (2022-06-01T17:08:04Z) - Stochastic gradient descent with noise of machine learning type. Part
II: Continuous time analysis [0.0]
We show that in a certain noise regime, the optimization algorithm prefers 'flat' minima of the objective function in a sense which is different from the flat minimum selection of continuous time SGD with homogeneous noise.
arXiv Detail & Related papers (2021-06-04T16:34:32Z) - Noise and Fluctuation of Finite Learning Rate Stochastic Gradient
Descent [3.0079490585515343]
gradient descent (SGD) is relatively well understood in the vanishing learning rate regime.
We propose to study the basic properties of SGD and its variants in the non-vanishing learning rate regime.
arXiv Detail & Related papers (2020-12-07T12:31:43Z) - Shape Matters: Understanding the Implicit Bias of the Noise Covariance [76.54300276636982]
Noise in gradient descent provides a crucial implicit regularization effect for training over parameterized models.
We show that parameter-dependent noise -- induced by mini-batches or label perturbation -- is far more effective than Gaussian noise.
Our analysis reveals that parameter-dependent noise introduces a bias towards local minima with smaller noise variance, whereas spherical Gaussian noise does not.
arXiv Detail & Related papers (2020-06-15T18:31:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.