On Tight Convergence Rates of Without-replacement SGD
- URL: http://arxiv.org/abs/2004.08657v1
- Date: Sat, 18 Apr 2020 16:14:11 GMT
- Title: On Tight Convergence Rates of Without-replacement SGD
- Authors: Kwangjun Ahn and Suvrit Sra
- Abstract summary: For solving finite-sum optimization problems, SGD without replacement sampling is empirically shown to outperform SGD.
In this work, we overcome these limitations by analyzing step sizes that vary across epochs.
- Score: 52.99237600875452
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For solving finite-sum optimization problems, SGD without replacement
sampling is empirically shown to outperform SGD. Denoting by $n$ the number of
components in the cost and $K$ the number of epochs of the algorithm , several
recent works have shown convergence rates of without-replacement SGD that have
better dependency on $n$ and $K$ than the baseline rate of $O(1/(nK))$ for SGD.
However, there are two main limitations shared among those works: the rates
have extra poly-logarithmic factors on $nK$, and denoting by $\kappa$ the
condition number of the problem, the rates hold after $\kappa^c\log(nK)$ epochs
for some $c>0$. In this work, we overcome these limitations by analyzing step
sizes that vary across epochs.
Related papers
- Incremental Gradient Descent with Small Epoch Counts is Surprisingly Slow on Ill-Conditioned Problems [10.65078014704416]
We study the naive variant, Incremental Gradient Descent (IGD), on smooth convex functions.<n>Our analyses reveal that for the small epoch regime, IGD can exhibit surprisingly when lower functions are strongly convex.<n>It is still unknown whether permutation-based SGD can converge faster in this small epoch regime.
arXiv Detail & Related papers (2025-06-04T16:17:25Z) - Rapid Overfitting of Multi-Pass Stochastic Gradient Descent in Stochastic Convex Optimization [34.451177321785146]
We study the out-of-sample performance of multi-pass gradient descent (SGD) in the fundamental convex optimization (SCO) model.<n>We show that in the general non-smooth case of SCO, just a few epochs of SGD can already hurt its out-of-sample significantly and lead to overfitting.
arXiv Detail & Related papers (2025-05-13T07:32:48Z) - Convergence Rate Analysis of LION [54.28350823319057]
LION converges iterations of $cal(sqrtdK-)$ measured by gradient Karush-Kuhn-T (sqrtdK-)$.
We show that LION can achieve lower loss and higher performance compared to standard SGD.
arXiv Detail & Related papers (2024-11-12T11:30:53Z) - Demystifying the Myths and Legends of Nonconvex Convergence of SGD [17.445810977264067]
gradient descent (SGD) and its variants are the main workhorses for solving large-scale optimization problems.
As our analyses, we addressed certain myths and legends related to the non convergence of the gradient.
arXiv Detail & Related papers (2023-10-19T17:58:59Z) - Convergence Analysis of Decentralized ASGD [1.8710230264817358]
We present a novel convergence-rate analysis for decentralized asynchronous SGD (DASGD) which does not require partial synchronization among nodes nor restrictive network topologies.
Our convergence proof holds for a fixed stepsize and any nonsmooth, homogeneous, L-shaped objective function.
arXiv Detail & Related papers (2023-09-07T14:50:31Z) - Sharper Convergence Guarantees for Asynchronous SGD for Distributed and
Federated Learning [77.22019100456595]
We show a training algorithm for distributed computation workers with varying communication frequency.
In this work, we obtain a tighter convergence rate of $mathcalO!!!(sigma2-2_avg!! .
We also show that the heterogeneity term in rate is affected by the average delay within each worker.
arXiv Detail & Related papers (2022-06-16T17:10:57Z) - Faster Convergence of Local SGD for Over-Parameterized Models [1.5504102675587357]
Modern machine learning architectures are often highly expressive.
We analyze the convergence of Local SGD (or FedAvg) for such over-parameterized functions in heterogeneous data setting.
For general convex loss functions, we establish an error bound $O(K/T)$ otherwise.
For non-loss functions, we prove an error bound $O(K/T)$ in both cases.
We complete our results by providing problem instances in which our established convergence rates are tight to a constant factor with a reasonably small stepsize.
arXiv Detail & Related papers (2022-01-30T04:05:56Z) - Towards Noise-adaptive, Problem-adaptive Stochastic Gradient Descent [7.176107039687231]
We design step-size schemes that make gradient descent (SGD) adaptive to (i) the noise.
We prove that $T$ iterations of SGD with Nesterov iterations can be near optimal.
Compared to other step-size schemes, we demonstrate the effectiveness of a novel novel exponential step-size scheme.
arXiv Detail & Related papers (2021-10-21T19:22:14Z) - SGD: The Role of Implicit Regularization, Batch-size and Multiple-epochs [30.41773138781369]
We present a multi-epoch variant of Gradient Descent (SGD) commonly used in practice.
We prove that this is at least as good as single pass SGD in the worst case.
For certain SCO problems, taking multiple passes over the dataset can significantly outperform single pass SGD.
arXiv Detail & Related papers (2021-07-11T15:50:01Z) - Random Shuffling Beats SGD Only After Many Epochs on Ill-Conditioned
Problems [55.40911408462676]
We show that without-replacement SGD emphdoes not significantly improve on with-replacement SGD in terms of worst-case bounds.
Since many problems in machine learning and other areas are both ill-conditioned and involve large datasets, this indicates that without-replacement does not necessarily improve over with-replacement sampling for realistic budgets.
arXiv Detail & Related papers (2021-06-12T23:07:27Z) - Fast Rates for the Regret of Offline Reinforcement Learning [69.23654172273085]
We study the regret of reinforcement learning from offline data generated by a fixed behavior policy in an infinitehorizon discounted decision process (MDP)
We show that given any estimate for the optimal quality function $Q*$, the regret of the policy it defines converges at a rate given by the exponentiation of the $Q*$-estimate's pointwise convergence rate.
arXiv Detail & Related papers (2021-01-31T16:17:56Z) - On the Almost Sure Convergence of Stochastic Gradient Descent in
Non-Convex Problems [75.58134963501094]
This paper analyzes the trajectories of gradient descent (SGD)
We show that SGD avoids saddle points/manifolds with $1$ for strict step-size policies.
arXiv Detail & Related papers (2020-06-19T14:11:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.