Stochastic gradient descent with noise of machine learning type. Part I:
Discrete time analysis
- URL: http://arxiv.org/abs/2105.01650v1
- Date: Tue, 4 May 2021 17:52:20 GMT
- Title: Stochastic gradient descent with noise of machine learning type. Part I:
Discrete time analysis
- Authors: Stephan Wojtowytsch
- Abstract summary: gradient descent (SGD) is one of the most popular algorithms in modern machine learning.
In this article, we discuss some of the common properties of energy landscapes and the noise encountered in machine learning problems.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stochastic gradient descent (SGD) is one of the most popular algorithms in
modern machine learning. The noise encountered in these applications is
different from that in many theoretical analyses of stochastic gradient
algorithms. In this article, we discuss some of the common properties of energy
landscapes and stochastic noise encountered in machine learning problems, and
how they affect SGD-based optimization.
In particular, we show that the learning rate in SGD with machine learning
noise can be chosen to be small, but uniformly positive for all times if the
energy landscape resembles that of overparametrized deep learning problems. If
the objective function satisfies a Lojasiewicz inequality, SGD converges to the
global minimum exponentially fast, and even for functions which may have local
minima, we establish almost sure convergence to the global minimum at an
exponential rate from any finite energy initialization. The assumptions that we
make in this result concern the behavior where the objective function is either
small or large and the nature of the gradient noise, but the energy landscape
is fairly unconstrained on the domain where the objective function takes values
in an intermediate regime.
Related papers
- Characterizing Dynamical Stability of Stochastic Gradient Descent in Overparameterized Learning [0.0]
We characterize global minima that are dynamically stable/unstable for both deterministic and gradient descent.
In particular, we introduce a characteristic Lyapunov exponent which depends on the local dynamics around a global minimum.
arXiv Detail & Related papers (2024-07-29T17:40:04Z) - A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks.
We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks.
Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z) - On the Theoretical Properties of Noise Correlation in Stochastic
Optimization [6.970991851511823]
We show that fPGD possesses exploration abilities favorable over PGD and Anti-PGD.
These results open the field to novel ways to exploit noise for machine learning models.
arXiv Detail & Related papers (2022-09-19T16:32:22Z) - Improved Convergence Rate of Stochastic Gradient Langevin Dynamics with
Variance Reduction and its Application to Optimization [50.83356836818667]
gradient Langevin Dynamics is one of the most fundamental algorithms to solve non-eps optimization problems.
In this paper, we show two variants of this kind, namely the Variance Reduced Langevin Dynamics and the Recursive Gradient Langevin Dynamics.
arXiv Detail & Related papers (2022-03-30T11:39:00Z) - Gradient flows and randomised thresholding: sparse inversion and
classification [0.0]
Sparse inversion and classification problems are ubiquitous in modern data science and imaging.
In classification, we consider, e.g., the sum of a data fidelity term and a non-smooth Ginzburg--Landau energy.
Standard (sub)gradient descent methods have shown to be inefficient when approaching such problems.
arXiv Detail & Related papers (2022-03-22T09:21:14Z) - Stochastic gradient descent with noise of machine learning type. Part
II: Continuous time analysis [0.0]
We show that in a certain noise regime, the optimization algorithm prefers 'flat' minima of the objective function in a sense which is different from the flat minimum selection of continuous time SGD with homogeneous noise.
arXiv Detail & Related papers (2021-06-04T16:34:32Z) - Combining resampling and reweighting for faithful stochastic
optimization [1.52292571922932]
When the loss function is a sum of multiple terms, a popular method is gradient descent.
We show that the difference in the Lipschitz constants of multiple terms in the loss function causes gradient descent to different variances at different minimums.
arXiv Detail & Related papers (2021-05-31T04:21:25Z) - Why Do Local Methods Solve Nonconvex Problems? [54.284687261929115]
Non-used optimization is ubiquitous in modern machine learning.
We rigorously formalize it for instances of machine learning problems.
We hypothesize a unified explanation for this phenomenon.
arXiv Detail & Related papers (2021-03-24T19:34:11Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Shape Matters: Understanding the Implicit Bias of the Noise Covariance [76.54300276636982]
Noise in gradient descent provides a crucial implicit regularization effect for training over parameterized models.
We show that parameter-dependent noise -- induced by mini-batches or label perturbation -- is far more effective than Gaussian noise.
Our analysis reveals that parameter-dependent noise introduces a bias towards local minima with smaller noise variance, whereas spherical Gaussian noise does not.
arXiv Detail & Related papers (2020-06-15T18:31:02Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.