Related papers: Empirical Risk Minimization with Shuffled SGD: A Primal-Dual Perspective and Improved Bounds

Empirical Risk Minimization with Shuffled SGD: A Primal-Dual Perspective and Improved Bounds

URL: http://arxiv.org/abs/2306.12498v2
Date: Wed, 7 Feb 2024 18:06:54 GMT
Title: Empirical Risk Minimization with Shuffled SGD: A Primal-Dual Perspective and Improved Bounds
Authors: Xufeng Cai, Cheuk Yin Lin, Jelena Diakonikolas
Abstract summary: gradient descent (SGD) is perhaps the most prevalent optimization method in modern machine learning. It is only very recently that SGD with sampling without replacement -- shuffled SGD -- has been analyzed. We prove fine-grained complexity bounds that depend on the data matrix and are never worse than what is predicted by the existing bounds.
Score: 12.699376765058137
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Stochastic gradient descent (SGD) is perhaps the most prevalent optimization method in modern machine learning. Contrary to the empirical practice of sampling from the datasets without replacement and with (possible) reshuffling at each epoch, the theoretical counterpart of SGD usually relies on the assumption of sampling with replacement. It is only very recently that SGD with sampling without replacement -- shuffled SGD -- has been analyzed. For convex finite sum problems with $n$ components and under the $L$-smoothness assumption for each component function, there are matching upper and lower bounds, under sufficiently small -- $\mathcal{O}(\frac{1}{nL})$ -- step sizes. Yet those bounds appear too pessimistic -- in fact, the predicted performance is generally no better than for full gradient descent -- and do not agree with the empirical observations. In this work, to narrow the gap between the theory and practice of shuffled SGD, we sharpen the focus from general finite sum problems to empirical risk minimization with linear predictors. This allows us to take a primal-dual perspective and interpret shuffled SGD as a primal-dual method with cyclic coordinate updates on the dual side. Leveraging this perspective, we prove fine-grained complexity bounds that depend on the data matrix and are never worse than what is predicted by the existing bounds. Notably, our bounds predict much faster convergence than the existing analyses -- by a factor of the order of $\sqrt{n}$ in some cases. We empirically demonstrate that on common machine learning datasets our bounds are indeed much tighter. We further extend our analysis to nonsmooth convex problems and more general finite-sum problems, with similar improvements.

Related papers

Can SGD Handle Heavy-Tailed Noise? [6.111519084375339]
Gradient Descent (SGD) is a machine learning project of large-scale optimization, yet its theoretical behavior under heavy-tailed noise is poorly understood.<n>We rigorously investigate whether SGD, can provably succeed under such adverse conditions.
arXiv Detail & Related papers (2025-08-06T20:09:41Z)
Stochastic Gradient Descent in Non-Convex Problems: Asymptotic Convergence with Relaxed Step-Size via Stopping Time Methods [13.677904140815386]
Gradient Descent (SGD) is widely used in machine learning research. This paper introduces a novel analytical method for convergence analysis of SGD under more relaxed step-size conditions.
arXiv Detail & Related papers (2025-04-17T02:56:20Z)
A Unified Analysis for Finite Weight Averaging [50.75116992029417]
Averaging iterations of Gradient Descent (SGD) have achieved empirical success in training deep learning models, such as Weight Averaging (SWA), Exponential Moving Average (EMA), and LAtest Weight Averaging (LAWA) In this paper, we generalize LAWA as Finite Weight Averaging (FWA) and explain their advantages compared to SGD from the perspective of optimization and generalization.
arXiv Detail & Related papers (2024-11-20T10:08:22Z)
Convergence Rate Analysis of LION [54.28350823319057]
LION converges iterations of $cal(sqrtdK-)$ measured by gradient Karush-Kuhn-T (sqrtdK-)$. We show that LION can achieve lower loss and higher performance compared to standard SGD.
arXiv Detail & Related papers (2024-11-12T11:30:53Z)
Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems [56.86067111855056]
We consider clipped optimization problems with heavy-tailed noise with structured density. We show that it is possible to get faster rates of convergence than $mathcalO(K-(alpha - 1)/alpha)$, when the gradients have finite moments of order. We prove that the resulting estimates have negligible bias and controllable variance.
arXiv Detail & Related papers (2023-11-07T17:39:17Z)
Lower Generalization Bounds for GD and SGD in Smooth Stochastic Convex Optimization [9.019243171993553]
Training steps $T$ and step-size $eta$ might affect certify in smooth convex optimization (SCO) problems. We first provide tight excess risk lower bounds for Gradient Descent (GD) and Gradient Descent (SGD) Recent works show better rates can be attained but the improvement is reduced when training time is long.
arXiv Detail & Related papers (2023-03-19T20:24:33Z)
Sharper Rates and Flexible Framework for Nonconvex SGD with Client and Data Sampling [64.31011847952006]
We revisit the problem of finding an approximately stationary point of the average of $n$ smooth and possibly non-color functions. We generalize the $smallsfcolorgreen$ so that it can provably work with virtually any sampling mechanism. We provide the most general and most accurate analysis of optimal bound in the smooth non-color regime.
arXiv Detail & Related papers (2022-06-05T21:32:33Z)
Benign Underfitting of Stochastic Gradient Descent [72.38051710389732]
We study to what extent may gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit training data. We analyze the closely related with-replacement SGD, for which an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate.
arXiv Detail & Related papers (2022-02-27T13:25:01Z)
Stochastic regularized majorization-minimization with weakly convex and multi-convex surrogates [0.0]
We show that the first optimality gap of the proposed algorithm decays at the rate of the expected loss for various methods under nontens dependent data setting. We obtain first convergence point for various methods under nontens dependent data setting.
arXiv Detail & Related papers (2022-01-05T15:17:35Z)
A general sample complexity analysis of vanilla policy gradient [101.16957584135767]
Policy gradient (PG) is one of the most popular reinforcement learning (RL) problems. "vanilla" theoretical understanding of PG trajectory is one of the most popular methods for solving RL problems.
arXiv Detail & Related papers (2021-07-23T19:38:17Z)
Convergence Rates of Stochastic Gradient Descent under Infinite Noise Variance [14.06947898164194]
Heavy tails emerge in gradient descent (SGD) in various scenarios. We provide convergence guarantees for SGD under a state-dependent and heavy-tailed noise with a potentially infinite variance. Our results indicate that even under heavy-tailed noise with infinite variance, SGD can converge to the global optimum.
arXiv Detail & Related papers (2021-02-20T13:45:11Z)
Last iterate convergence of SGD for Least-Squares in the Interpolation regime [19.05750582096579]
We study the noiseless model in the fundamental least-squares setup. We assume that an optimum predictor fits perfectly inputs and outputs $langle theta_*, phi(X) rangle = Y$, where $phi(X)$ stands for a possibly infinite dimensional non-linear feature map.
arXiv Detail & Related papers (2021-02-05T14:02:20Z)
Better Theory for SGD in the Nonconvex World [2.6397379133308214]
Large-scale non optimization problems are ubiquitous in modern machine learning. We perform experiments on the effects of a wide array of synthetic minibatch sizes on the Gradient Descent (SG) problem.
arXiv Detail & Related papers (2020-02-09T09:56:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.