Random Shuffling Beats SGD Only After Many Epochs on Ill-Conditioned
Problems
- URL: http://arxiv.org/abs/2106.06880v1
- Date: Sat, 12 Jun 2021 23:07:27 GMT
- Title: Random Shuffling Beats SGD Only After Many Epochs on Ill-Conditioned
Problems
- Authors: Itay Safran and Ohad Shamir
- Abstract summary: We show that without-replacement SGD emphdoes not significantly improve on with-replacement SGD in terms of worst-case bounds.
Since many problems in machine learning and other areas are both ill-conditioned and involve large datasets, this indicates that without-replacement does not necessarily improve over with-replacement sampling for realistic budgets.
- Score: 55.40911408462676
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, there has been much interest in studying the convergence rates of
without-replacement SGD, and proving that it is faster than with-replacement
SGD in the worst case. However, these works ignore or do not provide tight
bounds in terms of the problem's geometry, including its condition number.
Perhaps surprisingly, we prove that when the condition number is taken into
account, without-replacement SGD \emph{does not} significantly improve on
with-replacement SGD in terms of worst-case bounds, unless the number of epochs
(passes over the data) is larger than the condition number. Since many problems
in machine learning and other areas are both ill-conditioned and involve large
datasets, this indicates that without-replacement does not necessarily improve
over with-replacement sampling for realistic iteration budgets. We show this by
providing new lower and upper bounds which are tight (up to log factors), for
quadratic problems with commuting quadratic terms, precisely quantifying the
dependence on the problem parameters.
Related papers
- Incremental Gradient Descent with Small Epoch Counts is Surprisingly Slow on Ill-Conditioned Problems [10.65078014704416]
We study the naive variant, Incremental Gradient Descent (IGD), on smooth convex functions.<n>Our analyses reveal that for the small epoch regime, IGD can exhibit surprisingly when lower functions are strongly convex.<n>It is still unknown whether permutation-based SGD can converge faster in this small epoch regime.
arXiv Detail & Related papers (2025-06-04T16:17:25Z) - Empirical Risk Minimization with Shuffled SGD: A Primal-Dual Perspective
and Improved Bounds [12.699376765058137]
gradient descent (SGD) is perhaps the most prevalent optimization method in modern machine learning.
It is only very recently that SGD with sampling without replacement -- shuffled SGD -- has been analyzed.
We prove fine-grained complexity bounds that depend on the data matrix and are never worse than what is predicted by the existing bounds.
arXiv Detail & Related papers (2023-06-21T18:14:44Z) - SketchySGD: Reliable Stochastic Optimization via Randomized Curvature
Estimates [19.420605210427635]
SketchySGD improves upon existing gradient methods in machine learning by using randomized low-rank approximations to the subsampled Hessian.
We show theoretically that SketchySGD with a fixed stepsize converges linearly to a small ball around the optimum.
In the ill-conditioned setting we show SketchySGD converges at a faster rate than SGD for least-squares problems.
arXiv Detail & Related papers (2022-11-16T01:05:41Z) - Benign Underfitting of Stochastic Gradient Descent [72.38051710389732]
We study to what extent may gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit training data.
We analyze the closely related with-replacement SGD, for which an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate.
arXiv Detail & Related papers (2022-02-27T13:25:01Z) - The Benefits of Implicit Regularization from SGD in Least Squares
Problems [116.85246178212616]
gradient descent (SGD) exhibits strong algorithmic regularization effects in practice.
We make comparisons of the implicit regularization afforded by (unregularized) average SGD with the explicit regularization of ridge regression.
arXiv Detail & Related papers (2021-08-10T09:56:47Z) - Stochastic Gradient Descent-Ascent and Consensus Optimization for Smooth
Games: Convergence Analysis under Expected Co-coercivity [49.66890309455787]
We introduce the expected co-coercivity condition, explain its benefits, and provide the first last-iterate convergence guarantees of SGDA and SCO.
We prove linear convergence of both methods to a neighborhood of the solution when they use constant step-size.
Our convergence guarantees hold under the arbitrary sampling paradigm, and we give insights into the complexity of minibatching.
arXiv Detail & Related papers (2021-06-30T18:32:46Z) - Can Single-Shuffle SGD be Better than Reshuffling SGD and GD? [77.82009268160053]
We conjecture that the means of matrix products corresponding to with- and without-replacement variants of SGD satisfy a series of spectral norm inequalities.
We present theorems that support our conjecture by proving several special cases.
arXiv Detail & Related papers (2021-03-12T04:34:45Z) - IntSGD: Floatless Compression of Stochastic Gradients [19.99615698375829]
We propose a family of loss integer compressions for Gradient Descent (SGD) that do not communicate by multiplying a single float.
We show that when the data is significantly heterogeneous, it may become increasingly hard to keep the integers and propose an alternative algorithm, IntyA, to solve this type of problems.
arXiv Detail & Related papers (2021-02-16T18:58:57Z) - On Tight Convergence Rates of Without-replacement SGD [52.99237600875452]
For solving finite-sum optimization problems, SGD without replacement sampling is empirically shown to outperform SGD.
In this work, we overcome these limitations by analyzing step sizes that vary across epochs.
arXiv Detail & Related papers (2020-04-18T16:14:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.