A view of mini-batch SGD via generating functions: conditions of
convergence, phase transitions, benefit from negative momenta
- URL: http://arxiv.org/abs/2206.11124v1
- Date: Wed, 22 Jun 2022 14:15:35 GMT
- Title: A view of mini-batch SGD via generating functions: conditions of
convergence, phase transitions, benefit from negative momenta
- Authors: Maksim Velikanov, Denis Kuznedelev, Dmitry Yarotsky
- Abstract summary: Mini-batch SGD with momentum is a fundamental algorithm for learning large predictive models.
We develop a new analytic framework to analyze mini-batch SGD for linear models at different momenta and sizes of batches.
- Score: 14.857119814202754
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mini-batch SGD with momentum is a fundamental algorithm for learning large
predictive models. In this paper we develop a new analytic framework to analyze
mini-batch SGD for linear models at different momenta and sizes of batches. Our
key idea is to describe the loss value sequence in terms of its generating
function, which can be written in a compact form assuming a diagonal
approximation for the second moments of model weights. By analyzing this
generating function, we deduce various conclusions on the convergence
conditions, phase structure of the model, and optimal learning settings. As a
few examples, we show that 1) the optimization trajectory can generally switch
from the "signal-dominated" to the "noise-dominated" phase, at a time scale
that can be predicted analytically; 2) in the "signal-dominated" (but not the
"noise-dominated") phase it is favorable to choose a large effective learning
rate, however its value must be limited for any finite batch size to avoid
divergence; 3) optimal convergence rate can be achieved at a negative momentum.
We verify our theoretical predictions by extensive experiments with MNIST and
synthetic problems, and find a good quantitative agreement.
Related papers
- Max-affine regression via first-order methods [7.12511675782289]
The max-affine model ubiquitously arises in applications in signal processing and statistics.
We present a non-asymptotic convergence analysis of gradient descent (GD) and mini-batch gradient descent (SGD) for max-affine regression.
arXiv Detail & Related papers (2023-08-15T23:46:44Z) - Towards Faster Non-Asymptotic Convergence for Diffusion-Based Generative
Models [49.81937966106691]
We develop a suite of non-asymptotic theory towards understanding the data generation process of diffusion models.
In contrast to prior works, our theory is developed based on an elementary yet versatile non-asymptotic approach.
arXiv Detail & Related papers (2023-06-15T16:30:08Z) - Learning Unnormalized Statistical Models via Compositional Optimization [73.30514599338407]
Noise-contrastive estimation(NCE) has been proposed by formulating the objective as the logistic loss of the real data and the artificial noise.
In this paper, we study it a direct approach for optimizing the negative log-likelihood of unnormalized models.
arXiv Detail & Related papers (2023-06-13T01:18:16Z) - Sharper Analysis for Minibatch Stochastic Proximal Point Methods:
Stability, Smoothness, and Deviation [41.082982732100696]
We study a minibatch variant of proximal point (SPP) methods, namely M-SPP, for solving convex composite risk minimization problems.
We show that M-SPP with minibatch-size $n$ and quadratic count $T$ enjoys an in-expectation fast rate of convergence.
In the small-$n$-large-$T$ setting, this result substantially improves the best known results of SPP-type approaches.
arXiv Detail & Related papers (2023-01-09T00:13:34Z) - Instability and Local Minima in GAN Training with Kernel Discriminators [20.362912591032636]
Generative Adversarial Networks (GANs) are a widely-used tool for generative modeling of complex data.
Despite their empirical success, the training of GANs is not fully understood due to the min-max optimization of the generator and discriminator.
This paper analyzes these joint dynamics when the true samples, as well as the generated samples, are discrete, finite sets, and the discriminator is kernel-based.
arXiv Detail & Related papers (2022-08-21T18:03:06Z) - High-dimensional limit theorems for SGD: Effective dynamics and critical
scaling [6.950316788263433]
We prove limit theorems for the trajectories of summary statistics of gradient descent (SGD)
We show a critical scaling regime for the step-size, below which the effective ballistic dynamics matches gradient flow for the population loss.
About the fixed points of this effective dynamics, the corresponding diffusive limits can be quite complex and even degenerate.
arXiv Detail & Related papers (2022-06-08T17:42:18Z) - KL-Entropy-Regularized RL with a Generative Model is Minimax Optimal [70.15267479220691]
We consider and analyze the sample complexity of model reinforcement learning with a generative variance-free model.
Our analysis shows that it is nearly minimax-optimal for finding an $varepsilon$-optimal policy when $varepsilon$ is sufficiently small.
arXiv Detail & Related papers (2022-05-27T19:39:24Z) - Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and
Beyond [63.59034509960994]
We study shuffling-based variants: minibatch and local Random Reshuffling, which draw gradients without replacement.
For smooth functions satisfying the Polyak-Lojasiewicz condition, we obtain convergence bounds which show that these shuffling-based variants converge faster than their with-replacement counterparts.
We propose an algorithmic modification called synchronized shuffling that leads to convergence rates faster than our lower bounds in near-homogeneous settings.
arXiv Detail & Related papers (2021-10-20T02:25:25Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - On the Convergence of Stochastic Extragradient for Bilinear Games with
Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence.
We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z) - SGD for Structured Nonconvex Functions: Learning Rates, Minibatching and
Interpolation [17.199023009789308]
The Expected assumption of SGD (SGD) is being used routinely for non-artisan functions.
In this paper, we show a paradigms for convergence to a smooth non-linear setting.
We also provide theoretical guarantees for different step-size conditions.
arXiv Detail & Related papers (2020-06-18T07:05:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.