On discretisation drift and smoothness regularisation in neural network
training
- URL: http://arxiv.org/abs/2310.14036v1
- Date: Sat, 21 Oct 2023 15:21:36 GMT
- Title: On discretisation drift and smoothness regularisation in neural network
training
- Authors: Mihaela Claudia Rosca
- Abstract summary: We aim to make steps towards an improved understanding of deep learning with a focus on optimisation and model regularisation.
We start by investigating gradient descent (GD), a discrete-time algorithm at the basis of most popular deep learning optimisation algorithms.
We derive novel continuous-time flows that account for discretisation drift. Unlike the NGF, these new flows can be used to describe learning rate specific behaviours of GD, such as training instabilities observed in supervised learning and two-player games.
We then translate insights from continuous time into mitigation strategies for unstable GD dynamics, by constructing novel learning rate schedules and regulariser
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The deep learning recipe of casting real-world problems as mathematical
optimisation and tackling the optimisation by training deep neural networks
using gradient-based optimisation has undoubtedly proven to be a fruitful one.
The understanding behind why deep learning works, however, has lagged behind
its practical significance. We aim to make steps towards an improved
understanding of deep learning with a focus on optimisation and model
regularisation. We start by investigating gradient descent (GD), a
discrete-time algorithm at the basis of most popular deep learning optimisation
algorithms. Understanding the dynamics of GD has been hindered by the presence
of discretisation drift, the numerical integration error between GD and its
often studied continuous-time counterpart, the negative gradient flow (NGF). To
add to the toolkit available to study GD, we derive novel continuous-time flows
that account for discretisation drift. Unlike the NGF, these new flows can be
used to describe learning rate specific behaviours of GD, such as training
instabilities observed in supervised learning and two-player games. We then
translate insights from continuous time into mitigation strategies for unstable
GD dynamics, by constructing novel learning rate schedules and regularisers
that do not require additional hyperparameters. Like optimisation, smoothness
regularisation is another pillar of deep learning's success with wide use in
supervised learning and generative modelling. Despite their individual
significance, the interactions between smoothness regularisation and
optimisation have yet to be explored. We find that smoothness regularisation
affects optimisation across multiple deep learning domains, and that
incorporating smoothness regularisation in reinforcement learning leads to a
performance boost that can be recovered using adaptions to optimisation
methods.
Related papers
- Efficient Weight-Space Laplace-Gaussian Filtering and Smoothing for Sequential Deep Learning [29.328769628694484]
Efficiently learning a sequence of related tasks, such as in continual learning, poses a significant challenge for neural nets.
We address this challenge with a grounded framework for sequentially learning related tasks based on Bayesian inference.
arXiv Detail & Related papers (2024-10-09T11:54:33Z) - Gradient-Variation Online Learning under Generalized Smoothness [56.38427425920781]
gradient-variation online learning aims to achieve regret guarantees that scale with variations in gradients of online functions.
Recent efforts in neural network optimization suggest a generalized smoothness condition, allowing smoothness to correlate with gradient norms.
We provide the applications for fast-rate convergence in games and extended adversarial optimization.
arXiv Detail & Related papers (2024-08-17T02:22:08Z) - The Marginal Value of Momentum for Small Learning Rate SGD [20.606430391298815]
Momentum is known to accelerate the convergence of gradient descent in strongly convex settings without gradient noise regimes.
Experiments show that momentum indeed has limited benefits for both optimization and generalization in practical training where the optimal learning rate is not very large.
arXiv Detail & Related papers (2023-07-27T21:01:26Z) - Lottery Tickets in Evolutionary Optimization: On Sparse
Backpropagation-Free Trainability [0.0]
We study gradient descent (GD)-based sparse training and evolution strategies (ES)
We find that ES explore diverse and flat local optima and do not preserve linear mode connectivity across sparsity levels and independent runs.
arXiv Detail & Related papers (2023-05-31T15:58:54Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem)
AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient.
Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z) - Reparameterized Variational Divergence Minimization for Stable Imitation [57.06909373038396]
We study the extent to which variations in the choice of probabilistic divergence may yield more performant ILO algorithms.
We contribute a re parameterization trick for adversarial imitation learning to alleviate the challenges of the promising $f$-divergence minimization framework.
Empirically, we demonstrate that our design choices allow for ILO algorithms that outperform baseline approaches and more closely match expert performance in low-dimensional continuous-control tasks.
arXiv Detail & Related papers (2020-06-18T19:04:09Z) - AdaS: Adaptive Scheduling of Stochastic Gradients [50.80697760166045]
We introduce the notions of textit"knowledge gain" and textit"mapping condition" and propose a new algorithm called Adaptive Scheduling (AdaS)
Experimentation reveals that, using the derived metrics, AdaS exhibits: (a) faster convergence and superior generalization over existing adaptive learning methods; and (b) lack of dependence on a validation set to determine when to stop training.
arXiv Detail & Related papers (2020-06-11T16:36:31Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z) - Gradient Monitored Reinforcement Learning [0.0]
We focus on the enhancement of training and evaluation performance in reinforcement learning algorithms.
We propose an approach to steer the learning in the weight parameters of a neural network based on the dynamic development and feedback from the training process itself.
arXiv Detail & Related papers (2020-05-25T13:45:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.