Related papers: Trajectory of Mini-Batch Momentum: Batch Size Saturation and Convergence in High Dimensions

Trajectory of Mini-Batch Momentum: Batch Size Saturation and Convergence in High Dimensions

URL: http://arxiv.org/abs/2206.01029v1
Date: Thu, 2 Jun 2022 13:03:14 GMT
Title: Trajectory of Mini-Batch Momentum: Batch Size Saturation and Convergence in High Dimensions
Authors: Kiwon Lee, Andrew N. Cheng, Courtney Paquette and Elliot Paquette
Abstract summary: We show that the dynamics of SGD+M converge to a deterministic discrete Volterra equation as dimension increases. For batch sizes smaller than the ICR, SGD+M has rates that scale like a multiple of the single batch SGD rate.
Score: 2.575030923243061
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We analyze the dynamics of large batch stochastic gradient descent with momentum (SGD+M) on the least squares problem when both the number of samples and dimensions are large. In this setting, we show that the dynamics of SGD+M converge to a deterministic discrete Volterra equation as dimension increases, which we analyze. We identify a stability measurement, the implicit conditioning ratio (ICR), which regulates the ability of SGD+M to accelerate the algorithm. When the batch size exceeds this ICR, SGD+M converges linearly at a rate of $\mathcal{O}(1/\sqrt{\kappa})$, matching optimal full-batch momentum (in particular performing as well as a full-batch but with a fraction of the size). For batch sizes smaller than the ICR, in contrast, SGD+M has rates that scale like a multiple of the single batch SGD rate. We give explicit choices for the learning rate and momentum parameter in terms of the Hessian spectra that achieve this performance.

Related papers

Faster Convergence of Riemannian Stochastic Gradient Descent with Increasing Batch Size [0.6906005491572401]
Using an increasing batch size leads to faster RSGD convergence rate than using a constant batch size.<n>Increasing batch size reduces the SFO of RSGD.
arXiv Detail & Related papers (2025-01-30T06:23:28Z)
Increasing Batch Size Improves Convergence of Stochastic Gradient Descent with Momentum [0.6906005491572401]
gradient descent with momentum (SGDM) has been well studied in both theory and practice. We focus on mini-batch SGDM with constant learning rate and constant momentum weight.
arXiv Detail & Related papers (2025-01-15T15:53:27Z)
Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems [56.86067111855056]
We consider clipped optimization problems with heavy-tailed noise with structured density. We show that it is possible to get faster rates of convergence than $mathcalO(K-(alpha - 1)/alpha)$, when the gradients have finite moments of order. We prove that the resulting estimates have negligible bias and controllable variance.
arXiv Detail & Related papers (2023-11-07T17:39:17Z)
The Effect of SGD Batch Size on Autoencoder Learning: Sparsity, Sharpness, and Feature Learning [14.004531386769328]
We investigate the dynamics of gradient descent (SGD) when a single-neuron autoencoder is used. For any batch size smaller than the number of samples, SGD finds a global minimum which is sparse and nearly strictly to its randomness.
arXiv Detail & Related papers (2023-08-06T21:54:07Z)
High-dimensional limit theorems for SGD: Effective dynamics and critical scaling [6.950316788263433]
We prove limit theorems for the trajectories of summary statistics of gradient descent (SGD) We show a critical scaling regime for the step-size, below which the effective ballistic dynamics matches gradient flow for the population loss. About the fixed points of this effective dynamics, the corresponding diffusive limits can be quite complex and even degenerate.
arXiv Detail & Related papers (2022-06-08T17:42:18Z)
On Large Batch Training and Sharp Minima: A Fokker-Planck Perspective [0.0]
We study the statistical properties of the dynamic trajectory of gradient descent (SGD) We exploit the continuous formulation of SDE and the theory of Fokker-Planck equations to develop new results on escaping phenomenon and relationship with large batch and sharp minima.
arXiv Detail & Related papers (2021-12-02T05:24:05Z)
Last Iterate Risk Bounds of SGD with Decaying Stepsize for Overparameterized Linear Regression [122.70478935214128]
gradient descent (SGD) has been demonstrated to generalize well in many deep learning applications. This paper provides problem-dependent analysis on the last iterate risk bounds of SGD with decaying stepsize.
arXiv Detail & Related papers (2021-10-12T17:49:54Z)
On the Convergence of Stochastic Extragradient for Bilinear Games with Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence. We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z)
Dynamics of Stochastic Momentum Methods on Large-scale, Quadratic Models [0.2741266294612776]
We analyze a class of gradient algorithms with momentum on a high-dimensional random least squares problem. We show that (small-batch) momentum with a fixed momentum parameter provides no actual performance improvement over SGD when step sizes are adjusted correctly. In the non-strongly convex setting, it is possible to get a large improvement over SGD using momentum.
arXiv Detail & Related papers (2021-06-07T15:08:24Z)
SGD in the Large: Average-case Analysis, Asymptotics, and Stepsize Criticality [15.640534097470923]
We propose a new framework for analyzing the dynamics of gradient descent (SGD) when both number of samples and dimensions are large. Using this new framework, we show that the dynamics of SGD on a least squares problem with random data become deterministic in the large sample and dimensional limit.
arXiv Detail & Related papers (2021-02-08T18:00:13Z)
Balancing Rates and Variance via Adaptive Batch-Size for Stochastic Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error. Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z)
On the Almost Sure Convergence of Stochastic Gradient Descent in Non-Convex Problems [75.58134963501094]
This paper analyzes the trajectories of gradient descent (SGD) We show that SGD avoids saddle points/manifolds with $1$ for strict step-size policies.
arXiv Detail & Related papers (2020-06-19T14:11:26Z)
Momentum Improves Normalized SGD [51.27183254738711]
We show that adding momentum provably removes the need for large batch sizes on objectives. We show that our method is effective when employed on popular large scale tasks such as ResNet-50 and BERT pretraining.
arXiv Detail & Related papers (2020-02-09T07:00:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.