Related papers: Statistical Adaptive Stochastic Gradient Methods

Statistical Adaptive Stochastic Gradient Methods

URL: http://arxiv.org/abs/2002.10597v1
Date: Tue, 25 Feb 2020 00:04:16 GMT
Title: Statistical Adaptive Stochastic Gradient Methods
Authors: Pengchuan Zhang, Hunter Lang, Qiang Liu and Lin Xiao
Abstract summary: We propose a statistical adaptive procedure called SALSA for automatically scheduling the learning rate (step size) in gradient methods. SALSA first uses a smoothed line-search procedure to gradually increase the learning rate, then automatically decreases the learning rate. The method for decreasing the learning rate is based on a new statistical test for detecting station switches when using a constant step size.
Score: 34.859895010071234
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a statistical adaptive procedure called SALSA for automatically scheduling the learning rate (step size) in stochastic gradient methods. SALSA first uses a smoothed stochastic line-search procedure to gradually increase the learning rate, then automatically switches to a statistical method to decrease the learning rate. The line search procedure ``warms up'' the optimization process, reducing the need for expensive trial and error in setting an initial learning rate. The method for decreasing the learning rate is based on a new statistical test for detecting stationarity when using a constant step size. Unlike in prior work, our test applies to a broad class of stochastic gradient algorithms without modification. The combined method is highly robust and autonomous, and it matches the performance of the best hand-tuned learning rate schedules in our experiments on several deep learning tasks.

Related papers

Posterior Approximation using Stochastic Gradient Ascent with Adaptive Stepsize [24.464140786923476]
posterior approximation allow nonparametrics such as Dirichlet process mixture to scale up to larger dataset at fractional cost. gradient ascent is a modern approach to machine learning and is widely deployed in the training of deep neural networks. In this work, we explore using gradient ascent as a fast algorithm for the posterior approximation of Dirichlet process mixture.
arXiv Detail & Related papers (2024-12-12T05:33:23Z)
Adaptive Rentention & Correction for Continual Learning [114.5656325514408]
A common problem in continual learning is the classification layer's bias towards the most recent task. We name our approach Adaptive Retention & Correction (ARC) ARC achieves an average performance increase of 2.7% and 2.6% on the CIFAR-100 and Imagenet-R datasets.
arXiv Detail & Related papers (2024-05-23T08:43:09Z)
Low-rank extended Kalman filtering for online learning of neural networks from streaming data [71.97861600347959]
We propose an efficient online approximate Bayesian inference algorithm for estimating the parameters of a nonlinear function from a potentially non-stationary data stream. The method is based on the extended Kalman filter (EKF), but uses a novel low-rank plus diagonal decomposition of the posterior matrix. In contrast to methods based on variational inference, our method is fully deterministic, and does not require step-size tuning.
arXiv Detail & Related papers (2023-05-31T03:48:49Z)
Learning-Rate-Free Learning by D-Adaptation [18.853820404058983]
D-Adaptation is an approach to automatically setting the learning rate which achieves the optimal rate of convergence for convex Lipschitz functions. We present extensive experiments for SGD and Adam variants of our method, where the method automatically matches hand-tuned learning rates across more than a dozen diverse machine learning problems.
arXiv Detail & Related papers (2023-01-18T19:00:50Z)
Continuous-Time Meta-Learning with Forward Mode Differentiation [65.26189016950343]
We introduce Continuous Meta-Learning (COMLN), a meta-learning algorithm where adaptation follows the dynamics of a gradient vector field. Treating the learning process as an ODE offers the notable advantage that the length of the trajectory is now continuous. We show empirically its efficiency in terms of runtime and memory usage, and we illustrate its effectiveness on a range of few-shot image classification problems.
arXiv Detail & Related papers (2022-03-02T22:35:58Z)
Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling. Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z)
Training Aware Sigmoidal Optimizer [2.99368851209995]
Training Aware Sigmoidal functions present landscapes with much more saddle loss than local minima. We proposed the Training Aware Sigmoidal functions (TASO), which consists of a two-phases automated learning rate schedule. We compared the proposed approach with commonly used adaptive learning rate schedules such as Adam, RMS, and Adagrad.
arXiv Detail & Related papers (2021-02-17T12:00:46Z)
Automatic Tuning of Stochastic Gradient Descent with Bayesian Optimisation [8.340191147575307]
We introduce an original probabilistic model for traces of optimisers, based on latent Gaussian processes and an auto-/regressive formulation. It flexibly adjusts to abrupt changes of behaviours induced by new learning rate values. It is well-suited to tackle a set of problems: first, for the on-line adaptation of the learning rate for a cold-started run; then, for tuning the schedule for a set of similar tasks, as well as warm-starting it for a new task.
arXiv Detail & Related papers (2020-06-25T13:18:18Z)
AdaS: Adaptive Scheduling of Stochastic Gradients [50.80697760166045]
We introduce the notions of textit"knowledge gain" and textit"mapping condition" and propose a new algorithm called Adaptive Scheduling (AdaS) Experimentation reveals that, using the derived metrics, AdaS exhibits: (a) faster convergence and superior generalization over existing adaptive learning methods; and (b) lack of dependence on a validation set to determine when to stop training.
arXiv Detail & Related papers (2020-06-11T16:36:31Z)
Meta-learning with Stochastic Linear Bandits [120.43000970418939]
We consider a class of bandit algorithms that implement a regularized version of the well-known OFUL algorithm, where the regularization is a square euclidean distance to a bias vector. We show both theoretically and experimentally, that when the number of tasks grows and the variance of the task-distribution is small, our strategies have a significant advantage over learning the tasks in isolation.
arXiv Detail & Related papers (2020-05-18T08:41:39Z)
Automatic, Dynamic, and Nearly Optimal Learning Rate Specification by Local Quadratic Approximation [7.386152866234369]
In deep learning tasks, the learning rate determines the update step size in each iteration. We propose a novel optimization method based on local quadratic approximation (LQA)
arXiv Detail & Related papers (2020-04-07T10:55:12Z)
Stochastic gradient descent with random learning rate [0.0]
We propose to optimize neural networks with a uniformly-distributed random learning rate. By comparing the random learning rate protocol with cyclic and constant protocols, we suggest that the random choice is generically the best strategy in the small learning rate regime. We provide supporting evidence through experiments on both shallow, fully-connected and deep, convolutional neural networks for image classification on the MNIST and CIFAR10 datasets.
arXiv Detail & Related papers (2020-03-15T21:36:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.