Related papers: Second-order step-size tuning of SGD for non-convex optimization

Second-order step-size tuning of SGD for non-convex optimization

URL: http://arxiv.org/abs/2103.03570v1
Date: Fri, 5 Mar 2021 10:01:48 GMT
Title: Second-order step-size tuning of SGD for non-convex optimization
Authors: Camille Castera, J\'er\^ome Bolte, C\'edric F\'evotte, Edouard Pauwels
Abstract summary: In view of a direct and simple improvement of vanilla SGD, this paper presents a fine-tuning of its step-sizes in the mini-batch case. One obtains a new first-order gradient method (Step-Tuned SGD) which can be seen as a version of the classical Barzilai-Borwein method.
Score: 6.021787236982659
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In view of a direct and simple improvement of vanilla SGD, this paper presents a fine-tuning of its step-sizes in the mini-batch case. For doing so, one estimates curvature, based on a local quadratic model and using only noisy gradient approximations. One obtains a new stochastic first-order method (Step-Tuned SGD) which can be seen as a stochastic version of the classical Barzilai-Borwein method. Our theoretical results ensure almost sure convergence to the critical set and we provide convergence rates. Experiments on deep residual network training illustrate the favorable properties of our approach. For such networks we observe, during training, both a sudden drop of the loss and an improvement of test accuracy at medium stages, yielding better results than SGD, RMSprop, or ADAM.

Related papers

Curvature-Informed SGD via General Purpose Lie-Group Preconditioners [6.760212042305871]
We present a novel approach to accelerate gradient descent (SGD) by utilizing curvature information. Our approach involves two preconditioners: a matrix-free preconditioner and a low-rank approximation preconditioner. We demonstrate that Preconditioned SGD (PSGD) outperforms SoTA on Vision, NLP, and RL tasks.
arXiv Detail & Related papers (2024-02-07T03:18:00Z)
Consensus-Adaptive RANSAC [104.87576373187426]
We propose a new RANSAC framework that learns to explore the parameter space by considering the residuals seen so far via a novel attention layer. The attention mechanism operates on a batch of point-to-model residuals, and updates a per-point estimation state to take into account the consensus found through a lightweight one-step transformer.
arXiv Detail & Related papers (2023-07-26T08:25:46Z)
Online Bootstrap Inference with Nonconvex Stochastic Gradient Descent Estimator [0.0]
In this paper, we investigate the theoretical properties of gradient descent (SGD) for statistical inference in the context of convex problems. We propose two coferential procedures which may contain multiple error minima.
arXiv Detail & Related papers (2023-06-03T22:08:10Z)
Differentiable Annealed Importance Sampling and the Perils of Gradient Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation. Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective. We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z)
On the Convergence of Stochastic Extragradient for Bilinear Games with Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence. We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z)
Balancing Rates and Variance via Adaptive Batch-Size for Stochastic Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error. Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
Path Sample-Analytic Gradient Estimators for Stochastic Binary Networks [78.76880041670904]
In neural networks with binary activations and or binary weights the training by gradient descent is complicated. We propose a new method for this estimation problem combining sampling and analytic approximation steps. We experimentally show higher accuracy in gradient estimation and demonstrate a more stable and better performing training in deep convolutional models.
arXiv Detail & Related papers (2020-06-04T21:51:21Z)
Joint Stochastic Approximation and Its Application to Learning Discrete Latent Variable Models [19.07718284287928]
We show that the difficulty of obtaining reliable gradients for the inference model and the drawback of indirectly optimizing the target log-likelihood can be gracefully addressed. We propose to directly maximize the target log-likelihood and simultaneously minimize the inclusive divergence between the posterior and the inference model. The resulting learning algorithm is called joint SA (JSA)
arXiv Detail & Related papers (2020-05-28T13:50:08Z)
Robust Learning Rate Selection for Stochastic Optimization via Splitting Diagnostic [5.395127324484869]
SplitSGD is a new dynamic learning schedule for optimization. The method decreases the learning rate for better adaptation to the local geometry of the objective function. It essentially does not incur additional computational cost than standard SGD.
arXiv Detail & Related papers (2019-10-18T19:38:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.