Related papers: Losing momentum in continuous-time stochastic optimisation

Losing momentum in continuous-time stochastic optimisation

URL: http://arxiv.org/abs/2209.03705v1
Date: Thu, 8 Sep 2022 10:46:05 GMT
Title: Losing momentum in continuous-time stochastic optimisation
Authors: Kexin Jin, Jonas Latz, Chenguang Liu, Alessandro Scagliotti
Abstract summary: momentum-based algorithms have become especially popular in recent years. In work, we propose and analyse a continuous-time model for gradient descent with momentum. We show convergence of our system to the global minimiser when reducing momentum over time.
Score: 62.997667081978825
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The training of deep neural networks and other modern machine learning models usually consists in solving non-convex optimisation problems that are high-dimensional and subject to large-scale data. Here, momentum-based stochastic optimisation algorithms have become especially popular in recent years. The stochasticity arises from data subsampling which reduces computational cost. Moreover, both, momentum and stochasticity are supposed to help the algorithm to overcome local minimisers and, hopefully, converge globally. Theoretically, this combination of stochasticity and momentum is badly understood. In this work, we propose and analyse a continuous-time model for stochastic gradient descent with momentum. This model is a piecewise-deterministic Markov process that represents the particle movement by an underdamped dynamical system and the data subsampling through a stochastic switching of the dynamical system. In our analysis, we investigate longtime limits, the subsampling-to-no-subsampling limit, and the momentum-to-no-momentum limit. We are particularly interested in the case of reducing the momentum over time: intuitively, the momentum helps to overcome local minimisers in the initial phase of the algorithm, but prohibits fast convergence to a global minimiser later. Under convexity assumptions, we show convergence of our dynamical system to the global minimiser when reducing momentum over time and let the subsampling rate go to infinity. We then propose a stable, symplectic discretisation scheme to construct an algorithm from our continuous-time dynamical system. In numerical experiments, we study our discretisation scheme in convex and non-convex test problems. Additionally, we train a convolutional neural network to solve the CIFAR-10 image classification problem. Here, our algorithm reaches competitive results compared to stochastic gradient descent with momentum.

Related papers

Self-Supervised Coarsening of Unstructured Grid with Automatic Differentiation [55.88862563823878]
In this work, we present an original algorithm to coarsen an unstructured grid based on the concepts of differentiable physics.<n>We demonstrate performance of the algorithm on two PDEs: a linear equation which governs slightly compressible fluid flow in porous media and the wave equation.<n>Our results show that in the considered scenarios, we reduced the number of grid points up to 10 times while preserving the modeled variable dynamics in the points of interest.
arXiv Detail & Related papers (2025-07-24T11:02:13Z)
Reduced-Order Neural Operators: Learning Lagrangian Dynamics on Highly Sparse Graphs [20.271792055491662]
We propose to accelerate the simulation of Lagrangian dynamics, such as fluid flows, granular flows, and elastoplasticity, with neural-operator-based reduced-order modeling. Our framework trains on any spatial discretizations and computes temporal dynamics on any sparse sampling of these discretizations through neural operators.
arXiv Detail & Related papers (2024-07-04T13:37:26Z)
Optimistic Active Exploration of Dynamical Systems [52.91573056896633]
We develop an algorithm for active exploration called OPAX. We show how OPAX can be reduced to an optimal control problem that can be solved at each episode. Our experiments show that OPAX is not only theoretically sound but also performs well for zero-shot planning on novel downstream tasks.
arXiv Detail & Related papers (2023-06-21T16:26:59Z)
Hessian Averaging in Stochastic Newton Methods Achieves Superlinear Convergence [69.65563161962245]
We consider a smooth and strongly convex objective function using a Newton method. We show that there exists a universal weighted averaging scheme that transitions to local convergence at an optimal stage.
arXiv Detail & Related papers (2022-04-20T07:14:21Z)
Breaking the Convergence Barrier: Optimization via Fixed-Time Convergent Flows [4.817429789586127]
We introduce a Poly-based optimization framework for achieving acceleration, based on the notion of fixed-time stability dynamical systems. We validate the accelerated convergence properties of the proposed schemes on a range of numerical examples against the state-of-the-art optimization algorithms.
arXiv Detail & Related papers (2021-12-02T16:04:40Z)
Variational Inference for Continuous-Time Switching Dynamical Systems [29.984955043675157]
We present a model based on an Markov jump process modulating a subordinated diffusion process. We develop a new continuous-time variational inference algorithm. We extensively evaluate our algorithm under the model assumption and for real-world examples.
arXiv Detail & Related papers (2021-09-29T15:19:51Z)
Stochastic Optimization under Distributional Drift [3.0229888038442922]
We provide non-asymptotic convergence guarantees for algorithms with iterate averaging, focusing on bounds valid both in expectation and with high probability. We identify a low drift-to-noise regime in which the tracking efficiency of the gradient method benefits significantly from a step decay schedule.
arXiv Detail & Related papers (2021-08-16T21:57:39Z)
Reconstructing a dynamical system and forecasting time series by self-consistent deep learning [4.947248396489835]
We introduce a self-consistent deep-learning framework for a noisy deterministic time series. It provides unsupervised filtering, state-space reconstruction, identification of the underlying differential equations and forecasting.
arXiv Detail & Related papers (2021-08-04T06:10:58Z)
Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step. Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z)
Multiplicative noise and heavy tails in stochastic optimization [62.993432503309485]
empirical optimization is central to modern machine learning, but its role in its success is still unclear. We show that it commonly arises in parameters of discrete multiplicative noise due to variance. A detailed analysis is conducted in which we describe on key factors, including recent step size, and data, all exhibit similar results on state-of-the-art neural network models.
arXiv Detail & Related papers (2020-06-11T09:58:01Z)
Dynamical mean-field theory for stochastic gradient descent in Gaussian mixture classification [25.898873960635534]
We analyze in a closed learning dynamics of gradient descent (SGD) for a single-layer neural network classifying a high-dimensional landscape. We define a prototype process for which can be extended to a continuous-dimensional gradient flow. In the full-batch limit, we recover the standard gradient flow.
arXiv Detail & Related papers (2020-06-10T22:49:41Z)
Liquid Time-constant Networks [117.57116214802504]
We introduce a new class of time-continuous recurrent neural network models. Instead of declaring a learning system's dynamics by implicit nonlinearities, we construct networks of linear first-order dynamical systems. These neural networks exhibit stable and bounded behavior, yield superior expressivity within the family of neural ordinary differential equations.
arXiv Detail & Related papers (2020-06-08T09:53:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.