Related papers: Stochastic gradient descent with random learning rate

Stochastic gradient descent with random learning rate

URL: http://arxiv.org/abs/2003.06926v4
Date: Sun, 11 Oct 2020 13:42:20 GMT
Title: Stochastic gradient descent with random learning rate
Authors: Daniele Musso
Abstract summary: We propose to optimize neural networks with a uniformly-distributed random learning rate. By comparing the random learning rate protocol with cyclic and constant protocols, we suggest that the random choice is generically the best strategy in the small learning rate regime. We provide supporting evidence through experiments on both shallow, fully-connected and deep, convolutional neural networks for image classification on the MNIST and CIFAR10 datasets.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose to optimize neural networks with a uniformly-distributed random learning rate. The associated stochastic gradient descent algorithm can be approximated by continuous stochastic equations and analyzed within the Fokker-Planck formalism. In the small learning rate regime, the training process is characterized by an effective temperature which depends on the average learning rate, the mini-batch size and the momentum of the optimization algorithm. By comparing the random learning rate protocol with cyclic and constant protocols, we suggest that the random choice is generically the best strategy in the small learning rate regime, yielding better regularization without extra computational cost. We provide supporting evidence through experiments on both shallow, fully-connected and deep, convolutional neural networks for image classification on the MNIST and CIFAR10 datasets.

Related papers

Cyclical Log Annealing as a Learning Rate Scheduler [0.0]
A learning rate scheduler is a set of instructions for varying search stepsizes during model training processes. This paper introduces a new logarithmic method using harsh restarting of step sizes through descent gradient.
arXiv Detail & Related papers (2024-03-13T14:07:20Z)
Stochastic Gradient Descent for Gaussian Processes Done Right [86.83678041846971]
We show that when emphdone right -- by which we mean using specific insights from optimisation and kernel communities -- gradient descent is highly effective. We introduce a emphstochastic dual descent algorithm, explain its design in an intuitive manner and illustrate the design choices. Our method places Gaussian process regression on par with state-of-the-art graph neural networks for molecular binding affinity prediction.
arXiv Detail & Related papers (2023-10-31T16:15:13Z)
Online Network Source Optimization with Graph-Kernel MAB [62.6067511147939]
We propose Grab-UCB, a graph- kernel multi-arms bandit algorithm to learn online the optimal source placement in large scale networks. We describe the network processes with an adaptive graph dictionary model, which typically leads to sparse spectral representations. We derive the performance guarantees that depend on network parameters, which further influence the learning curve of the sequential decision strategy.
arXiv Detail & Related papers (2023-07-07T15:03:42Z)
Low-rank extended Kalman filtering for online learning of neural networks from streaming data [71.97861600347959]
We propose an efficient online approximate Bayesian inference algorithm for estimating the parameters of a nonlinear function from a potentially non-stationary data stream. The method is based on the extended Kalman filter (EKF), but uses a novel low-rank plus diagonal decomposition of the posterior matrix. In contrast to methods based on variational inference, our method is fully deterministic, and does not require step-size tuning.
arXiv Detail & Related papers (2023-05-31T03:48:49Z)
Stochastic Unrolled Federated Learning [85.6993263983062]
We introduce UnRolled Federated learning (SURF), a method that expands algorithm unrolling to federated learning. Our proposed method tackles two challenges of this expansion, namely the need to feed whole datasets to the unrolleds and the decentralized nature of federated learning.
arXiv Detail & Related papers (2023-05-24T17:26:22Z)
Accelerated Almost-Sure Convergence Rates for Nonconvex Stochastic Gradient Descent using Stochastic Learning Rates [0.0]
This paper uses almost-sure convergence rates of gradient-sure convergence rates of Gradient Descent to solve large-scale optimization problems. In particular, its learning rate is equipped with a multiplicativeity learning rate.
arXiv Detail & Related papers (2021-10-25T04:27:35Z)
Stochastic Learning Rate Optimization in the Stochastic Approximation and Online Learning Settings [0.0]
In this work, multiplicativeity is applied to the learning rate of optimization algorithms, giving rise to learning-rate schemes. In this work, theoretical convergence results of Gradient Descent equipped with this novel learning rate scheme are presented.
arXiv Detail & Related papers (2021-10-20T18:10:03Z)
A Simple and Efficient Stochastic Rounding Method for Training Neural Networks in Low Precision [0.0]
Conventional rounding (CSR) is widely employed in the training of neural networks (NNs) We introduce an improved rounding method, that is simple and efficient. The proposed method succeeds in training NNs with 16-bit fixed-point numbers.
arXiv Detail & Related papers (2021-03-24T18:47:03Z)
Learning Rates as a Function of Batch Size: A Random Matrix Theory Approach to Neural Network Training [2.9649783577150837]
We study the effect of mini-batching on the loss landscape of deep neural networks using spiked, field-dependent random matrix theory. We derive analytical expressions for the maximal descent and adaptive training regimens for smooth, non-Newton deep neural networks. We validate our claims on the VGG/ResNet and ImageNet datasets.
arXiv Detail & Related papers (2020-06-16T11:55:45Z)
AdaS: Adaptive Scheduling of Stochastic Gradients [50.80697760166045]
We introduce the notions of textit"knowledge gain" and textit"mapping condition" and propose a new algorithm called Adaptive Scheduling (AdaS) Experimentation reveals that, using the derived metrics, AdaS exhibits: (a) faster convergence and superior generalization over existing adaptive learning methods; and (b) lack of dependence on a validation set to determine when to stop training.
arXiv Detail & Related papers (2020-06-11T16:36:31Z)
Path Sample-Analytic Gradient Estimators for Stochastic Binary Networks [78.76880041670904]
In neural networks with binary activations and or binary weights the training by gradient descent is complicated. We propose a new method for this estimation problem combining sampling and analytic approximation steps. We experimentally show higher accuracy in gradient estimation and demonstrate a more stable and better performing training in deep convolutional models.
arXiv Detail & Related papers (2020-06-04T21:51:21Z)
Large Batch Training Does Not Need Warmup [111.07680619360528]
Training deep neural networks using a large batch size has shown promising results and benefits many real-world applications. In this paper, we propose a novel Complete Layer-wise Adaptive Rate Scaling (CLARS) algorithm for large-batch training. Based on our analysis, we bridge the gap and illustrate the theoretical insights for three popular large-batch training techniques.
arXiv Detail & Related papers (2020-02-04T23:03:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.