Convergence of Online Adaptive and Recurrent Optimization Algorithms
- URL: http://arxiv.org/abs/2005.05645v2
- Date: Fri, 8 Jan 2021 09:11:45 GMT
- Title: Convergence of Online Adaptive and Recurrent Optimization Algorithms
- Authors: Pierre-Yves Mass\'e (CTU), Yann Ollivier (FAIR)
- Abstract summary: We prove local convergence of several notable descent algorithms used in machine learning.
We adopt an "ergodic" rather than probabilistic viewpoint, working with empirical time averages instead of probability distributions.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We prove local convergence of several notable gradient descent algorithms
used in machine learning, for which standard stochastic gradient descent theory
does not apply directly. This includes, first, online algorithms for recurrent
models and dynamical systems, such as \emph{Real-time recurrent learning}
(RTRL) and its computationally lighter approximations NoBackTrack and UORO;
second, several adaptive algorithms such as RMSProp, online natural gradient,
and Adam with $\beta^2\to 1$.Despite local convergence being a relatively weak
requirement for a new optimization algorithm, no local analysis was available
for these algorithms, as far as we knew. Analysis of these algorithms does not
immediately follow from standard stochastic gradient (SGD) theory. In fact,
Adam has been proved to lack local convergence in some simple situations
\citep{j.2018on}. For recurrent models, online algorithms modify the parameter
while the model is running, which further complicates the analysis with respect
to simple SGD.Local convergence for these various algorithms results from a
single, more general set of assumptions, in the setup of learning dynamical
systems online. Thus, these results can cover other variants of the algorithms
considered.We adopt an "ergodic" rather than probabilistic viewpoint, working
with empirical time averages instead of probability distributions. This is more
data-agnostic and creates differences with respect to standard SGD theory,
especially for the range of possible learning rates. For instance, with cycling
or per-epoch reshuffling over a finite dataset instead of pure i.i.d.\ sampling
with replacement, empirical averages of gradients converge at rate $1/T$
instead of $1/\sqrt{T}$ (cycling acts as a variance reduction method),
theoretically allowing for larger learning rates than in SGD.
Related papers
- Online Learning Under A Separable Stochastic Approximation Framework [20.26530917721778]
We propose an online learning algorithm for a class of machine learning models under a separable approximation framework.
We show that the proposed algorithm produces more robust and test performance when compared to other popular learning algorithms.
arXiv Detail & Related papers (2023-05-12T13:53:03Z) - Simple Stochastic and Online Gradient DescentAlgorithms for Pairwise
Learning [65.54757265434465]
Pairwise learning refers to learning tasks where the loss function depends on a pair instances.
Online descent (OGD) is a popular approach to handle streaming data in pairwise learning.
In this paper, we propose simple and online descent to methods for pairwise learning.
arXiv Detail & Related papers (2021-11-23T18:10:48Z) - Correcting Momentum with Second-order Information [50.992629498861724]
We develop a new algorithm for non-critical optimization that finds an $O(epsilon)$epsilon point in the optimal product.
We validate our results on a variety of large-scale deep learning benchmarks and architectures.
arXiv Detail & Related papers (2021-03-04T19:01:20Z) - Sample Complexity Bounds for Two Timescale Value-based Reinforcement
Learning Algorithms [65.09383385484007]
Two timescale approximation (SA) has been widely used in value-based reinforcement learning algorithms.
We study the non-asymptotic convergence rate of two timescale linear and nonlinear TDC and Greedy-GQ algorithms.
arXiv Detail & Related papers (2020-11-10T11:36:30Z) - Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth
Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step.
Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z) - FedPD: A Federated Learning Framework with Optimal Rates and Adaptivity
to Non-IID Data [59.50904660420082]
Federated Learning (FL) has become a popular paradigm for learning from distributed data.
To effectively utilize data at different devices without moving them to the cloud, algorithms such as the Federated Averaging (FedAvg) have adopted a "computation then aggregation" (CTA) model.
arXiv Detail & Related papers (2020-05-22T23:07:42Z) - Stochastic Proximal Gradient Algorithm with Minibatches. Application to
Large Scale Learning Models [2.384873896423002]
We develop and analyze minibatch variants of gradient algorithm for general composite objective functions with nonsmooth components.
We provide complexity for constant and variable stepsize iteration policies obtaining that, for minibatch size $N$, after $mathcalO(frac1Nepsilon)$ $epsilon-$subity is attained in expected quadratic distance to optimal solution.
arXiv Detail & Related papers (2020-03-30T10:43:56Z) - Learning Sparse Classifiers: Continuous and Mixed Integer Optimization
Perspectives [10.291482850329892]
Mixed integer programming (MIP) can be used to solve (to optimality) $ell_0$-regularized regression problems.
We propose two classes of scalable algorithms: an exact algorithm that can handlepapprox 50,000$ features in a few minutes, and approximate algorithms that can address instances with $papprox6$.
In addition, we present new estimation error bounds for $ell$-regularizeds.
arXiv Detail & Related papers (2020-01-17T18:47:02Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.