Losing momentum in continuous-time stochastic optimisation
- URL: http://arxiv.org/abs/2209.03705v2
- Date: Tue, 05 Nov 2024 11:09:15 GMT
- Title: Losing momentum in continuous-time stochastic optimisation
- Authors: Kexin Jin, Jonas Latz, Chenguang Liu, Alessandro Scagliotti,
- Abstract summary: momentum-based optimisation algorithms have become particularly widespread.
In this work, we analyse a continuous-time model for gradient descent with momentum.
We also train a convolutional neural network in an image classification problem.
- Score: 42.617042045455506
- License:
- Abstract: The training of modern machine learning models often consists in solving high-dimensional non-convex optimisation problems that are subject to large-scale data. In this context, momentum-based stochastic optimisation algorithms have become particularly widespread. The stochasticity arises from data subsampling which reduces computational cost. Both, momentum and stochasticity help the algorithm to converge globally. In this work, we propose and analyse a continuous-time model for stochastic gradient descent with momentum. This model is a piecewise-deterministic Markov process that represents the optimiser by an underdamped dynamical system and the data subsampling through a stochastic switching. We investigate longtime limits, the subsampling-to-no-subsampling limit, and the momentum-to-no-momentum limit. We are particularly interested in the case of reducing the momentum over time. Under convexity assumptions, we show convergence of our dynamical system to the global minimiser when reducing momentum over time and letting the subsampling rate go to infinity. We then propose a stable, symplectic discretisation scheme to construct an algorithm from our continuous-time dynamical system. In experiments, we study our scheme in convex and non-convex test problems. Additionally, we train a convolutional neural network in an image classification problem. Our algorithm {attains} competitive results compared to stochastic gradient descent with momentum.
Related papers
- Reduced-Order Neural Operators: Learning Lagrangian Dynamics on Highly Sparse Graphs [20.271792055491662]
We propose to accelerate the simulation of Lagrangian dynamics, such as fluid flows, granular flows, and elastoplasticity, with neural-operator-based reduced-order modeling.
Our framework trains on any spatial discretizations and computes temporal dynamics on any sparse sampling of these discretizations through neural operators.
arXiv Detail & Related papers (2024-07-04T13:37:26Z) - Optimistic Active Exploration of Dynamical Systems [52.91573056896633]
We develop an algorithm for active exploration called OPAX.
We show how OPAX can be reduced to an optimal control problem that can be solved at each episode.
Our experiments show that OPAX is not only theoretically sound but also performs well for zero-shot planning on novel downstream tasks.
arXiv Detail & Related papers (2023-06-21T16:26:59Z) - Hessian Averaging in Stochastic Newton Methods Achieves Superlinear
Convergence [69.65563161962245]
We consider a smooth and strongly convex objective function using a Newton method.
We show that there exists a universal weighted averaging scheme that transitions to local convergence at an optimal stage.
arXiv Detail & Related papers (2022-04-20T07:14:21Z) - Variational Inference for Continuous-Time Switching Dynamical Systems [29.984955043675157]
We present a model based on an Markov jump process modulating a subordinated diffusion process.
We develop a new continuous-time variational inference algorithm.
We extensively evaluate our algorithm under the model assumption and for real-world examples.
arXiv Detail & Related papers (2021-09-29T15:19:51Z) - Stochastic Optimization under Distributional Drift [3.0229888038442922]
We provide non-asymptotic convergence guarantees for algorithms with iterate averaging, focusing on bounds valid both in expectation and with high probability.
We identify a low drift-to-noise regime in which the tracking efficiency of the gradient method benefits significantly from a step decay schedule.
arXiv Detail & Related papers (2021-08-16T21:57:39Z) - Reconstructing a dynamical system and forecasting time series by
self-consistent deep learning [4.947248396489835]
We introduce a self-consistent deep-learning framework for a noisy deterministic time series.
It provides unsupervised filtering, state-space reconstruction, identification of the underlying differential equations and forecasting.
arXiv Detail & Related papers (2021-08-04T06:10:58Z) - Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth
Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step.
Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z) - Multiplicative noise and heavy tails in stochastic optimization [62.993432503309485]
empirical optimization is central to modern machine learning, but its role in its success is still unclear.
We show that it commonly arises in parameters of discrete multiplicative noise due to variance.
A detailed analysis is conducted in which we describe on key factors, including recent step size, and data, all exhibit similar results on state-of-the-art neural network models.
arXiv Detail & Related papers (2020-06-11T09:58:01Z) - Dynamical mean-field theory for stochastic gradient descent in Gaussian
mixture classification [25.898873960635534]
We analyze in a closed learning dynamics of gradient descent (SGD) for a single-layer neural network classifying a high-dimensional landscape.
We define a prototype process for which can be extended to a continuous-dimensional gradient flow.
In the full-batch limit, we recover the standard gradient flow.
arXiv Detail & Related papers (2020-06-10T22:49:41Z) - Liquid Time-constant Networks [117.57116214802504]
We introduce a new class of time-continuous recurrent neural network models.
Instead of declaring a learning system's dynamics by implicit nonlinearities, we construct networks of linear first-order dynamical systems.
These neural networks exhibit stable and bounded behavior, yield superior expressivity within the family of neural ordinary differential equations.
arXiv Detail & Related papers (2020-06-08T09:53:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.