Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and
Beyond
- URL: http://arxiv.org/abs/2110.10342v1
- Date: Wed, 20 Oct 2021 02:25:25 GMT
- Title: Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and
Beyond
- Authors: Chulhee Yun, Shashank Rajput, Suvrit Sra
- Abstract summary: We study shuffling-based variants: minibatch and local Random Reshuffling, which draw gradients without replacement.
For smooth functions satisfying the Polyak-Lojasiewicz condition, we obtain convergence bounds which show that these shuffling-based variants converge faster than their with-replacement counterparts.
We propose an algorithmic modification called synchronized shuffling that leads to convergence rates faster than our lower bounds in near-homogeneous settings.
- Score: 63.59034509960994
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In distributed learning, local SGD (also known as federated averaging) and
its simple baseline minibatch SGD are widely studied optimization methods. Most
existing analyses of these methods assume independent and unbiased gradient
estimates obtained via with-replacement sampling. In contrast, we study
shuffling-based variants: minibatch and local Random Reshuffling, which draw
stochastic gradients without replacement and are thus closer to practice. For
smooth functions satisfying the Polyak-{\L}ojasiewicz condition, we obtain
convergence bounds (in the large epoch regime) which show that these
shuffling-based variants converge faster than their with-replacement
counterparts. Moreover, we prove matching lower bounds showing that our
convergence analysis is tight. Finally, we propose an algorithmic modification
called synchronized shuffling that leads to convergence rates faster than our
lower bounds in near-homogeneous settings.
Related papers
- High-Probability Convergence for Composite and Distributed Stochastic Minimization and Variational Inequalities with Heavy-Tailed Noise [96.80184504268593]
gradient, clipping is one of the key algorithmic ingredients to derive good high-probability guarantees.
Clipping can spoil the convergence of the popular methods for composite and distributed optimization.
arXiv Detail & Related papers (2023-10-03T07:49:17Z) - SGDA with shuffling: faster convergence for nonconvex-P{\L} minimax
optimization [18.668531108219415]
We present a theoretical approach for solving minimax optimization problems using sequentially descent-ascent gradient (SGDA)
We analyze both simultaneous and alternating SGDA-LL objectives for non-concave objectives with Polyak-Lojasiewicz (PL) geometry.
Our rates also extend to mini-batch-GDARR, recovering few known rates for full gradient-batch descent-ascent gradient (GDA)
Finally, we present a comprehensive lower bound for two-time-scale GDA, which matches the full rate for primal-PL-
arXiv Detail & Related papers (2022-10-12T08:05:41Z) - Sampling without Replacement Leads to Faster Rates in Finite-Sum Minimax
Optimization [12.794526174053134]
We analyze convergence rates of gradient algorithms for smooth finite-sum minimax optimization.
We show that, for many such algorithms, sampling points without replacement leads to faster convergence compared to sampling with replacement.
arXiv Detail & Related papers (2022-06-07T00:37:37Z) - Faster One-Sample Stochastic Conditional Gradient Method for Composite
Convex Minimization [61.26619639722804]
We propose a conditional gradient method (CGM) for minimizing convex finite-sum objectives formed as a sum of smooth and non-smooth terms.
The proposed method, equipped with an average gradient (SAG) estimator, requires only one sample per iteration. Nevertheless, it guarantees fast convergence rates on par with more sophisticated variance reduction techniques.
arXiv Detail & Related papers (2022-02-26T19:10:48Z) - Non-Asymptotic Analysis of Stochastic Approximation Algorithms for
Streaming Data [0.0]
Streaming framework is analogous to solving optimization problems using time-varying mini-batches that arrive sequentially.
We provide non-asymptotic convergence rates of various gradient-based algorithms.
We show how to accelerate convergence by choosing the learning rate according to the time-varying mini-batches.
arXiv Detail & Related papers (2021-09-15T06:58:23Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - Stochastic Gradient Descent-Ascent and Consensus Optimization for Smooth
Games: Convergence Analysis under Expected Co-coercivity [49.66890309455787]
We introduce the expected co-coercivity condition, explain its benefits, and provide the first last-iterate convergence guarantees of SGDA and SCO.
We prove linear convergence of both methods to a neighborhood of the solution when they use constant step-size.
Our convergence guarantees hold under the arbitrary sampling paradigm, and we give insights into the complexity of minibatching.
arXiv Detail & Related papers (2021-06-30T18:32:46Z) - On the Convergence of Stochastic Extragradient for Bilinear Games with
Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence.
We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z) - Incremental Without Replacement Sampling in Nonconvex Optimization [0.0]
Minibatch decomposition methods for empirical risk are commonly analysed in an approximation setting, also known as sampling with replacement.
On the other hands modern implementations of such techniques are incremental: they rely on sampling without replacement, for which available analysis are much scarcer.
We provide convergence guaranties for the latter variant by analysing a versatile incremental gradient scheme.
arXiv Detail & Related papers (2020-07-15T09:17:29Z) - Unified Analysis of Stochastic Gradient Methods for Composite Convex and
Smooth Optimization [15.82816385434718]
We present a unified theorem for the convergence analysis of gradient algorithms for minimizing a smooth and convex loss plus a convex regularizer.
We do this by extending the unified analysis of Gorbunov, Hanzely & Richt'arik ( 2020) and dropping the requirement that the loss function be strongly convex.
Our unified analysis applies to a host of existing algorithms such as proximal SGD, variance reduced methods, quantization and some coordinate descent type methods.
arXiv Detail & Related papers (2020-06-20T13:40:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.