Stochastic gradient descent introduces an effective landscape-dependent
regularization favoring flat solutions
- URL: http://arxiv.org/abs/2206.01246v1
- Date: Thu, 2 Jun 2022 18:49:36 GMT
- Title: Stochastic gradient descent introduces an effective landscape-dependent
regularization favoring flat solutions
- Authors: Ning Yang, Chao Tang, Yuhai Tu
- Abstract summary: Generalization is one of the most important problems in deep learning (DL)
There exist many low-loss solutions that fit the training data equally well.
The key question is which solution is more generalizable.
- Score: 5.022507593837554
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generalization is one of the most important problems in deep learning (DL).
In the overparameterized regime in neural networks, there exist many low-loss
solutions that fit the training data equally well. The key question is which
solution is more generalizable. Empirical studies showed a strong correlation
between flatness of the loss landscape at a solution and its generalizability,
and stochastic gradient descent (SGD) is crucial in finding the flat solutions.
To understand how SGD drives the learning system to flat solutions, we
construct a simple model whose loss landscape has a continuous set of
degenerate (or near degenerate) minima. By solving the Fokker-Planck equation
of the underlying stochastic learning dynamics, we show that due to its strong
anisotropy the SGD noise introduces an additional effective loss term that
decreases with flatness and has an overall strength that increases with the
learning rate and batch-to-batch variation. We find that the additional
landscape-dependent SGD-loss breaks the degeneracy and serves as an effective
regularization for finding flat solutions. Furthermore, a stronger SGD noise
shortens the convergence time to the flat solutions. However, we identify an
upper bound for the SGD noise beyond which the system fails to converge. Our
results not only elucidate the role of SGD for generalization they may also
have important implications for hyperparameter selection for learning
efficiently without divergence.
Related papers
- Non-convergence to global minimizers in data driven supervised deep learning: Adam and stochastic gradient descent optimization provably fail to converge to global minimizers in the training of deep neural networks with ReLU activation [3.6185342807265415]
It remains an open problem of research to explain the success and the limitations of SGD methods in rigorous theoretical terms.
In this work we prove for a large class of SGD methods that the considered does with high probability not converge to global minimizers of the optimization problem.
The general non-convergence results of this work do not only apply to the plain vanilla standard SGD method but also to a large class of accelerated and adaptive SGD methods.
arXiv Detail & Related papers (2024-10-14T14:11:37Z) - The Optimality of (Accelerated) SGD for High-Dimensional Quadratic Optimization [4.7256945641654164]
gradient descent (SGD) is a widely used algorithm in machine learning, particularly for neural network training.
Recent studies on SGD for canonical quadratic optimization or linear regression show it attains well generalization under suitable high-dimensional settings.
This paper investigates SGD with two essential components in practice: exponentially decaying step size schedule and momentum.
arXiv Detail & Related papers (2024-09-15T14:20:03Z) - An Option-Dependent Analysis of Regret Minimization Algorithms in
Finite-Horizon Semi-Markov Decision Processes [47.037877670620524]
We present an option-dependent upper bound to the regret suffered by regret minimization algorithms in finite-horizon problems.
We illustrate that the performance improvement derives from the planning horizon reduction induced by the temporal abstraction enforced by the hierarchical structure.
arXiv Detail & Related papers (2023-05-10T15:00:05Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - Adaptive Self-supervision Algorithms for Physics-informed Neural
Networks [59.822151945132525]
Physics-informed neural networks (PINNs) incorporate physical knowledge from the problem domain as a soft constraint on the loss function.
We study the impact of the location of the collocation points on the trainability of these models.
We propose a novel adaptive collocation scheme which progressively allocates more collocation points to areas where the model is making higher errors.
arXiv Detail & Related papers (2022-07-08T18:17:06Z) - Implicit Regularization or Implicit Conditioning? Exact Risk
Trajectories of SGD in High Dimensions [26.782342518986503]
gradient descent (SGD) is a pillar of modern machine learning, serving as the go-to optimization algorithm for a diverse array of problems.
We show how to adapt the HSGD formalism to include streaming SGD, which allows us to produce an exact prediction for the excess risk of multi-pass SGD relative to that of streaming SGD.
arXiv Detail & Related papers (2022-06-15T02:32:26Z) - The Benefits of Implicit Regularization from SGD in Least Squares
Problems [116.85246178212616]
gradient descent (SGD) exhibits strong algorithmic regularization effects in practice.
We make comparisons of the implicit regularization afforded by (unregularized) average SGD with the explicit regularization of ridge regression.
arXiv Detail & Related papers (2021-08-10T09:56:47Z) - AlterSGD: Finding Flat Minima for Continual Learning by Alternative
Training [11.521519687645428]
We propose a simple yet effective optimization method, called AlterSGD, to search for a flat minima in the loss landscape.
We prove that such a strategy can encourage the optimization to converge to a flat minima.
We verify AlterSGD on continual learning benchmark for semantic segmentation and the empirical results show that we can significantly mitigate the forgetting.
arXiv Detail & Related papers (2021-07-13T01:43:51Z) - The Sobolev Regularization Effect of Stochastic Gradient Descent [8.193914488276468]
We show that flat minima regularize the gradient of the model function, which explains the good performance of flat minima.
We also consider high-order moments of gradient noise, and show that Gradient Dascent (SGD) tends to impose constraints on these moments by a linear analysis of SGD around global minima.
arXiv Detail & Related papers (2021-05-27T21:49:21Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Detached Error Feedback for Distributed SGD with Random Sparsification [98.98236187442258]
Communication bottleneck has been a critical problem in large-scale deep learning.
We propose a new distributed error feedback (DEF) algorithm, which shows better convergence than error feedback for non-efficient distributed problems.
We also propose DEFA to accelerate the generalization of DEF, which shows better bounds than DEF.
arXiv Detail & Related papers (2020-04-11T03:50:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.