Related papers: Stochastic gradient descent with noise of machine learning type. Part I: Discrete time analysis

Stochastic gradient descent with noise of machine learning type. Part I: Discrete time analysis

URL: http://arxiv.org/abs/2105.01650v1
Date: Tue, 4 May 2021 17:52:20 GMT
Title: Stochastic gradient descent with noise of machine learning type. Part I: Discrete time analysis
Authors: Stephan Wojtowytsch
Abstract summary: gradient descent (SGD) is one of the most popular algorithms in modern machine learning. In this article, we discuss some of the common properties of energy landscapes and the noise encountered in machine learning problems.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Stochastic gradient descent (SGD) is one of the most popular algorithms in modern machine learning. The noise encountered in these applications is different from that in many theoretical analyses of stochastic gradient algorithms. In this article, we discuss some of the common properties of energy landscapes and stochastic noise encountered in machine learning problems, and how they affect SGD-based optimization. In particular, we show that the learning rate in SGD with machine learning noise can be chosen to be small, but uniformly positive for all times if the energy landscape resembles that of overparametrized deep learning problems. If the objective function satisfies a Lojasiewicz inequality, SGD converges to the global minimum exponentially fast, and even for functions which may have local minima, we establish almost sure convergence to the global minimum at an exponential rate from any finite energy initialization. The assumptions that we make in this result concern the behavior where the objective function is either small or large and the nature of the gradient noise, but the energy landscape is fairly unconstrained on the domain where the objective function takes values in an intermediate regime.

Related papers

Characterizing Dynamical Stability of Stochastic Gradient Descent in Overparameterized Learning [0.0]
We characterize global minima that are dynamically stable/unstable for both deterministic and gradient descent. In particular, we introduce a characteristic Lyapunov exponent which depends on the local dynamics around a global minimum.
arXiv Detail & Related papers (2024-07-29T17:40:04Z)
A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks. We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks. Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z)
Optimal Rates and Saturation for Noiseless Kernel Ridge Regression [4.585021053685196]
We present a comprehensive study of Kernel ridge regression (KRR) in the noiseless regime. KRR is a fundamental method for learning functions from finite samples. We introduce a refined notion of degrees of freedom, which we believe has broader applicability in the analysis of kernel methods.
arXiv Detail & Related papers (2024-02-24T04:57:59Z)
On the Theoretical Properties of Noise Correlation in Stochastic Optimization [6.970991851511823]
We show that fPGD possesses exploration abilities favorable over PGD and Anti-PGD. These results open the field to novel ways to exploit noise for machine learning models.
arXiv Detail & Related papers (2022-09-19T16:32:22Z)
Improved Convergence Rate of Stochastic Gradient Langevin Dynamics with Variance Reduction and its Application to Optimization [50.83356836818667]
gradient Langevin Dynamics is one of the most fundamental algorithms to solve non-eps optimization problems. In this paper, we show two variants of this kind, namely the Variance Reduced Langevin Dynamics and the Recursive Gradient Langevin Dynamics.
arXiv Detail & Related papers (2022-03-30T11:39:00Z)
Gradient flows and randomised thresholding: sparse inversion and classification [0.0]
Sparse inversion and classification problems are ubiquitous in modern data science and imaging. In classification, we consider, e.g., the sum of a data fidelity term and a non-smooth Ginzburg--Landau energy. Standard (sub)gradient descent methods have shown to be inefficient when approaching such problems.
arXiv Detail & Related papers (2022-03-22T09:21:14Z)
Stochastic gradient descent with noise of machine learning type. Part II: Continuous time analysis [0.0]
We show that in a certain noise regime, the optimization algorithm prefers 'flat' minima of the objective function in a sense which is different from the flat minimum selection of continuous time SGD with homogeneous noise.
arXiv Detail & Related papers (2021-06-04T16:34:32Z)
Combining resampling and reweighting for faithful stochastic optimization [1.52292571922932]
When the loss function is a sum of multiple terms, a popular method is gradient descent. We show that the difference in the Lipschitz constants of multiple terms in the loss function causes gradient descent to different variances at different minimums.
arXiv Detail & Related papers (2021-05-31T04:21:25Z)
Why Do Local Methods Solve Nonconvex Problems? [54.284687261929115]
Non-used optimization is ubiquitous in modern machine learning. We rigorously formalize it for instances of machine learning problems. We hypothesize a unified explanation for this phenomenon.
arXiv Detail & Related papers (2021-03-24T19:34:11Z)
Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime. We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z)
Shape Matters: Understanding the Implicit Bias of the Noise Covariance [76.54300276636982]
Noise in gradient descent provides a crucial implicit regularization effect for training over parameterized models. We show that parameter-dependent noise -- induced by mini-batches or label perturbation -- is far more effective than Gaussian noise. Our analysis reveals that parameter-dependent noise introduces a bias towards local minima with smaller noise variance, whereas spherical Gaussian noise does not.
arXiv Detail & Related papers (2020-06-15T18:31:02Z)
Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks. In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems. Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.