Losing momentum in continuous-time stochastic optimisation
- URL: http://arxiv.org/abs/2209.03705v1
- Date: Thu, 8 Sep 2022 10:46:05 GMT
- Title: Losing momentum in continuous-time stochastic optimisation
- Authors: Kexin Jin, Jonas Latz, Chenguang Liu, Alessandro Scagliotti
- Abstract summary: momentum-based algorithms have become especially popular in recent years.
In work, we propose and analyse a continuous-time model for gradient descent with momentum.
We show convergence of our system to the global minimiser when reducing momentum over time.
- Score: 62.997667081978825
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The training of deep neural networks and other modern machine learning models
usually consists in solving non-convex optimisation problems that are
high-dimensional and subject to large-scale data. Here, momentum-based
stochastic optimisation algorithms have become especially popular in recent
years. The stochasticity arises from data subsampling which reduces
computational cost. Moreover, both, momentum and stochasticity are supposed to
help the algorithm to overcome local minimisers and, hopefully, converge
globally. Theoretically, this combination of stochasticity and momentum is
badly understood.
In this work, we propose and analyse a continuous-time model for stochastic
gradient descent with momentum. This model is a piecewise-deterministic Markov
process that represents the particle movement by an underdamped dynamical
system and the data subsampling through a stochastic switching of the dynamical
system. In our analysis, we investigate longtime limits, the
subsampling-to-no-subsampling limit, and the momentum-to-no-momentum limit. We
are particularly interested in the case of reducing the momentum over time:
intuitively, the momentum helps to overcome local minimisers in the initial
phase of the algorithm, but prohibits fast convergence to a global minimiser
later. Under convexity assumptions, we show convergence of our dynamical system
to the global minimiser when reducing momentum over time and let the
subsampling rate go to infinity.
We then propose a stable, symplectic discretisation scheme to construct an
algorithm from our continuous-time dynamical system. In numerical experiments,
we study our discretisation scheme in convex and non-convex test problems.
Additionally, we train a convolutional neural network to solve the CIFAR-10
image classification problem. Here, our algorithm reaches competitive results
compared to stochastic gradient descent with momentum.
Related papers
- Optimistic Active Exploration of Dynamical Systems [52.91573056896633]
We develop an algorithm for active exploration called OPAX.
We show how OPAX can be reduced to an optimal control problem that can be solved at each episode.
Our experiments show that OPAX is not only theoretically sound but also performs well for zero-shot planning on novel downstream tasks.
arXiv Detail & Related papers (2023-06-21T16:26:59Z) - Last-iterate convergence analysis of stochastic momentum methods for
neural networks [3.57214198937538]
The momentum method is used to solve large-scale optimization problems in neural networks.
Current convergence results of momentum methods under artificial settings.
The momentum factors can be fixed to be constant, rather than in existing time.
arXiv Detail & Related papers (2022-05-30T02:17:44Z) - Hessian Averaging in Stochastic Newton Methods Achieves Superlinear
Convergence [69.65563161962245]
We consider a smooth and strongly convex objective function using a Newton method.
We show that there exists a universal weighted averaging scheme that transitions to local convergence at an optimal stage.
arXiv Detail & Related papers (2022-04-20T07:14:21Z) - Breaking the Convergence Barrier: Optimization via Fixed-Time Convergent
Flows [4.817429789586127]
We introduce a Poly-based optimization framework for achieving acceleration, based on the notion of fixed-time stability dynamical systems.
We validate the accelerated convergence properties of the proposed schemes on a range of numerical examples against the state-of-the-art optimization algorithms.
arXiv Detail & Related papers (2021-12-02T16:04:40Z) - Stochastic Optimization under Distributional Drift [3.0229888038442922]
We provide non-asymptotic convergence guarantees for algorithms with iterate averaging, focusing on bounds valid both in expectation and with high probability.
We identify a low drift-to-noise regime in which the tracking efficiency of the gradient method benefits significantly from a step decay schedule.
arXiv Detail & Related papers (2021-08-16T21:57:39Z) - Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth
Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step.
Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z) - Multiplicative noise and heavy tails in stochastic optimization [62.993432503309485]
empirical optimization is central to modern machine learning, but its role in its success is still unclear.
We show that it commonly arises in parameters of discrete multiplicative noise due to variance.
A detailed analysis is conducted in which we describe on key factors, including recent step size, and data, all exhibit similar results on state-of-the-art neural network models.
arXiv Detail & Related papers (2020-06-11T09:58:01Z) - Dynamical mean-field theory for stochastic gradient descent in Gaussian
mixture classification [25.898873960635534]
We analyze in a closed learning dynamics of gradient descent (SGD) for a single-layer neural network classifying a high-dimensional landscape.
We define a prototype process for which can be extended to a continuous-dimensional gradient flow.
In the full-batch limit, we recover the standard gradient flow.
arXiv Detail & Related papers (2020-06-10T22:49:41Z) - Liquid Time-constant Networks [117.57116214802504]
We introduce a new class of time-continuous recurrent neural network models.
Instead of declaring a learning system's dynamics by implicit nonlinearities, we construct networks of linear first-order dynamical systems.
These neural networks exhibit stable and bounded behavior, yield superior expressivity within the family of neural ordinary differential equations.
arXiv Detail & Related papers (2020-06-08T09:53:35Z) - CoolMomentum: A Method for Stochastic Optimization by Langevin Dynamics
with Simulated Annealing [23.87373187143897]
Deep learning applications require global optimization of non- objective functions, which have multiple local minima.
We show that a coordinate learning algorithm can be used to resolve the same problem in physical simulations.
arXiv Detail & Related papers (2020-05-29T14:44:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.