NAG-GS: Semi-Implicit, Accelerated and Robust Stochastic Optimizer
- URL: http://arxiv.org/abs/2209.14937v2
- Date: Sat, 30 Sep 2023 21:07:16 GMT
- Title: NAG-GS: Semi-Implicit, Accelerated and Robust Stochastic Optimizer
- Authors: Valentin Leplat, Daniil Merkulov, Aleksandr Katrutsa, Daniel
Bershatsky, Olga Tsymboi, Ivan Oseledets
- Abstract summary: We propose a novel, robust and accelerated iteration that relies on two key elements.
The convergence and stability of the obtained method, referred to as NAG-GS, are first studied extensively.
We show that NAG-arity is competitive with state-the-art methods such as momentum SGD with weight decay and AdamW for the training of machine learning models.
- Score: 45.47667026025716
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Classical machine learning models such as deep neural networks are usually
trained by using Stochastic Gradient Descent-based (SGD) algorithms. The
classical SGD can be interpreted as a discretization of the stochastic gradient
flow. In this paper we propose a novel, robust and accelerated stochastic
optimizer that relies on two key elements: (1) an accelerated Nesterov-like
Stochastic Differential Equation (SDE) and (2) its semi-implicit Gauss-Seidel
type discretization. The convergence and stability of the obtained method,
referred to as NAG-GS, are first studied extensively in the case of the
minimization of a quadratic function. This analysis allows us to come up with
an optimal learning rate in terms of the convergence rate while ensuring the
stability of NAG-GS. This is achieved by the careful analysis of the spectral
radius of the iteration matrix and the covariance matrix at stationarity with
respect to all hyperparameters of our method. Further, we show that NAG- GS is
competitive with state-of-the-art methods such as momentum SGD with weight
decay and AdamW for the training of machine learning models such as the
logistic regression model, the residual networks models on standard computer
vision datasets, Transformers in the frame of the GLUE benchmark and the recent
Vision Transformers.
Related papers
- Convergence of Implicit Gradient Descent for Training Two-Layer Physics-Informed Neural Networks [3.680127959836384]
implicit gradient descent (IGD) outperforms the common gradient descent (GD) in handling certain multi-scale problems.
We show that IGD converges a globally optimal solution at a linear convergence rate.
arXiv Detail & Related papers (2024-07-03T06:10:41Z) - Hitting the High-Dimensional Notes: An ODE for SGD learning dynamics on
GLMs and multi-index models [10.781866671930857]
We analyze the dynamics of streaming gradient descent (SGD) in the high-dimensional limit.
We demonstrate a deterministic equivalent of SGD in the form of a system of ordinary differential equations.
In addition to the deterministic equivalent, we introduce an SDE with a simplified diffusion coefficient.
arXiv Detail & Related papers (2023-08-17T13:33:02Z) - Machine learning in and out of equilibrium [58.88325379746631]
Our study uses a Fokker-Planck approach, adapted from statistical physics, to explore these parallels.
We focus in particular on the stationary state of the system in the long-time limit, which in conventional SGD is out of equilibrium.
We propose a new variation of Langevin dynamics (SGLD) that harnesses without replacement minibatching.
arXiv Detail & Related papers (2023-06-06T09:12:49Z) - Capturing dynamical correlations using implicit neural representations [85.66456606776552]
We develop an artificial intelligence framework which combines a neural network trained to mimic simulated data from a model Hamiltonian with automatic differentiation to recover unknown parameters from experimental data.
In doing so, we illustrate the ability to build and train a differentiable model only once, which then can be applied in real-time to multi-dimensional scattering data.
arXiv Detail & Related papers (2023-04-08T07:55:36Z) - Utilising the CLT Structure in Stochastic Gradient based Sampling :
Improved Analysis and Faster Algorithms [14.174806471635403]
We consider approximations of sampling algorithms, such as Gradient Langevin Dynamics (SGLD) and the Random Batch Method (RBM) for Interacting Particle Dynamcs (IPD)
We observe that the noise introduced by the approximation is nearly Gaussian due to the Central Limit Theorem (CLT) while the driving Brownian motion is exactly Gaussian.
We harness this structure to absorb the approximation error inside the diffusion process, and obtain improved convergence guarantees for these algorithms.
arXiv Detail & Related papers (2022-06-08T10:17:40Z) - Improved Convergence Rate of Stochastic Gradient Langevin Dynamics with
Variance Reduction and its Application to Optimization [50.83356836818667]
gradient Langevin Dynamics is one of the most fundamental algorithms to solve non-eps optimization problems.
In this paper, we show two variants of this kind, namely the Variance Reduced Langevin Dynamics and the Recursive Gradient Langevin Dynamics.
arXiv Detail & Related papers (2022-03-30T11:39:00Z) - On the Double Descent of Random Features Models Trained with SGD [78.0918823643911]
We study properties of random features (RF) regression in high dimensions optimized by gradient descent (SGD)
We derive precise non-asymptotic error bounds of RF regression under both constant and adaptive step-size SGD setting.
We observe the double descent phenomenon both theoretically and empirically.
arXiv Detail & Related papers (2021-10-13T17:47:39Z) - Scalable Variational Gaussian Processes via Harmonic Kernel
Decomposition [54.07797071198249]
We introduce a new scalable variational Gaussian process approximation which provides a high fidelity approximation while retaining general applicability.
We demonstrate that, on a range of regression and classification problems, our approach can exploit input space symmetries such as translations and reflections.
Notably, our approach achieves state-of-the-art results on CIFAR-10 among pure GP models.
arXiv Detail & Related papers (2021-06-10T18:17:57Z) - Convergence Analysis of Homotopy-SGD for non-convex optimization [43.71213126039448]
We present a first-order algorithm based on a combination of homotopy methods and SGD, called Gradienty-Stoch Descent (H-SGD)
Under some assumptions, we conduct a theoretical analysis of the proposed problem.
Experimental results show that H-SGD can outperform SGD.
arXiv Detail & Related papers (2020-11-20T09:50:40Z) - A Contour Stochastic Gradient Langevin Dynamics Algorithm for
Simulations of Multi-modal Distributions [17.14287157979558]
We propose an adaptively weighted gradient Langevin dynamics (SGLD) for learning in big data statistics.
The proposed algorithm is tested on benchmark datasets including CIFAR100.
arXiv Detail & Related papers (2020-10-19T19:20:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.