Stochastic gradient descent with noise of machine learning type. Part I:
Discrete time analysis
- URL: http://arxiv.org/abs/2105.01650v1
- Date: Tue, 4 May 2021 17:52:20 GMT
- Title: Stochastic gradient descent with noise of machine learning type. Part I:
Discrete time analysis
- Authors: Stephan Wojtowytsch
- Abstract summary: gradient descent (SGD) is one of the most popular algorithms in modern machine learning.
In this article, we discuss some of the common properties of energy landscapes and the noise encountered in machine learning problems.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stochastic gradient descent (SGD) is one of the most popular algorithms in
modern machine learning. The noise encountered in these applications is
different from that in many theoretical analyses of stochastic gradient
algorithms. In this article, we discuss some of the common properties of energy
landscapes and stochastic noise encountered in machine learning problems, and
how they affect SGD-based optimization.
In particular, we show that the learning rate in SGD with machine learning
noise can be chosen to be small, but uniformly positive for all times if the
energy landscape resembles that of overparametrized deep learning problems. If
the objective function satisfies a Lojasiewicz inequality, SGD converges to the
global minimum exponentially fast, and even for functions which may have local
minima, we establish almost sure convergence to the global minimum at an
exponential rate from any finite energy initialization. The assumptions that we
make in this result concern the behavior where the objective function is either
small or large and the nature of the gradient noise, but the energy landscape
is fairly unconstrained on the domain where the objective function takes values
in an intermediate regime.
Related papers
- A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimiax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks.
We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks.
Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z) - Convergence of the mini-batch SIHT algorithm [0.0]
The Iterative Hard Thresholding (IHT) algorithm has been considered extensively as an effective deterministic algorithm for solving sparse optimizations.
We show that the sequence generated by the sparse mini-batch SIHT is a supermartingale sequence and converges with probability one.
arXiv Detail & Related papers (2022-09-29T03:47:46Z) - On the Theoretical Properties of Noise Correlation in Stochastic
Optimization [6.970991851511823]
We show that fPGD possesses exploration abilities favorable over PGD and Anti-PGD.
These results open the field to novel ways to exploit noise for machine learning models.
arXiv Detail & Related papers (2022-09-19T16:32:22Z) - Losing momentum in continuous-time stochastic optimisation [62.997667081978825]
momentum-based algorithms have become especially popular in recent years.
In work, we propose and analyse a continuous-time model for gradient descent with momentum.
We show convergence of our system to the global minimiser when reducing momentum over time.
arXiv Detail & Related papers (2022-09-08T10:46:05Z) - Improved Convergence Rate of Stochastic Gradient Langevin Dynamics with
Variance Reduction and its Application to Optimization [50.83356836818667]
gradient Langevin Dynamics is one of the most fundamental algorithms to solve non-eps optimization problems.
In this paper, we show two variants of this kind, namely the Variance Reduced Langevin Dynamics and the Recursive Gradient Langevin Dynamics.
arXiv Detail & Related papers (2022-03-30T11:39:00Z) - Gradient flows and randomised thresholding: sparse inversion and
classification [0.0]
Sparse inversion and classification problems are ubiquitous in modern data science and imaging.
In classification, we consider, e.g., the sum of a data fidelity term and a non-smooth Ginzburg--Landau energy.
Standard (sub)gradient descent methods have shown to be inefficient when approaching such problems.
arXiv Detail & Related papers (2022-03-22T09:21:14Z) - Stochastic gradient descent with noise of machine learning type. Part
II: Continuous time analysis [0.0]
We show that in a certain noise regime, the optimization algorithm prefers 'flat' minima of the objective function in a sense which is different from the flat minimum selection of continuous time SGD with homogeneous noise.
arXiv Detail & Related papers (2021-06-04T16:34:32Z) - Combining resampling and reweighting for faithful stochastic
optimization [1.52292571922932]
When the loss function is a sum of multiple terms, a popular method is gradient descent.
We show that the difference in the Lipschitz constants of multiple terms in the loss function causes gradient descent to different variances at different minimums.
arXiv Detail & Related papers (2021-05-31T04:21:25Z) - Why Do Local Methods Solve Nonconvex Problems? [54.284687261929115]
Non-used optimization is ubiquitous in modern machine learning.
We rigorously formalize it for instances of machine learning problems.
We hypothesize a unified explanation for this phenomenon.
arXiv Detail & Related papers (2021-03-24T19:34:11Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Shape Matters: Understanding the Implicit Bias of the Noise Covariance [76.54300276636982]
Noise in gradient descent provides a crucial implicit regularization effect for training over parameterized models.
We show that parameter-dependent noise -- induced by mini-batches or label perturbation -- is far more effective than Gaussian noise.
Our analysis reveals that parameter-dependent noise introduces a bias towards local minima with smaller noise variance, whereas spherical Gaussian noise does not.
arXiv Detail & Related papers (2020-06-15T18:31:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.