Can Single-Shuffle SGD be Better than Reshuffling SGD and GD?
- URL: http://arxiv.org/abs/2103.07079v1
- Date: Fri, 12 Mar 2021 04:34:45 GMT
- Title: Can Single-Shuffle SGD be Better than Reshuffling SGD and GD?
- Authors: Chulhee Yun, Suvrit Sra, Ali Jadbabaie
- Abstract summary: We conjecture that the means of matrix products corresponding to with- and without-replacement variants of SGD satisfy a series of spectral norm inequalities.
We present theorems that support our conjecture by proving several special cases.
- Score: 77.82009268160053
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose matrix norm inequalities that extend the Recht-R\'e (2012)
conjecture on a noncommutative AM-GM inequality by supplementing it with
another inequality that accounts for single-shuffle, which is a widely used
without-replacement sampling scheme that shuffles only once in the beginning
and is overlooked in the Recht-R\'e conjecture. Instead of general positive
semidefinite matrices, we restrict our attention to positive definite matrices
with small enough condition numbers, which are more relevant to matrices that
arise in the analysis of SGD. For such matrices, we conjecture that the means
of matrix products corresponding to with- and without-replacement variants of
SGD satisfy a series of spectral norm inequalities that can be summarized as:
"single-shuffle SGD converges faster than random-reshuffle SGD, which is in
turn faster than with-replacement SGD." We present theorems that support our
conjecture by proving several special cases.
Related papers
- Stochastic Extragradient with Random Reshuffling: Improved Convergence for Variational Inequalities [40.1331539540764]
We provide a convergence analysis of SEG with Random Reshuffling (SEG-RR) for three classes of VIPs.
We derive conditions under which SEG-RR achieves a faster convergence rate than the uniform with-replacement sampling SEG.
In the monotone setting, our analysis of SEG-RR guarantees convergence to an arbitrary accuracy without large batch sizes.
arXiv Detail & Related papers (2024-03-11T20:35:52Z) - Approximate orthogonality of permutation operators, with application to
quantum information [0.9790236766474201]
Consider the $n!$ different unitary matrices that permute $n$ $d$-dimensional quantum systems.
If $dgeq n$ then they are linearly independent.
This simple point has several applications in quantum information and random matrix theory.
arXiv Detail & Related papers (2023-09-01T19:45:30Z) - Empirical Risk Minimization with Shuffled SGD: A Primal-Dual Perspective
and Improved Bounds [12.699376765058137]
gradient descent (SGD) is perhaps the most prevalent optimization method in modern machine learning.
It is only very recently that SGD with sampling without replacement -- shuffled SGD -- has been analyzed.
We prove fine-grained complexity bounds that depend on the data matrix and are never worse than what is predicted by the existing bounds.
arXiv Detail & Related papers (2023-06-21T18:14:44Z) - Rank-1 Matrix Completion with Gradient Descent and Small Random
Initialization [15.127728811011245]
We show that implicit regularization of GD plays a critical role in analysis.
We observe that implicit regularization GD plays a critical role in affordable analysis.
arXiv Detail & Related papers (2022-12-19T12:05:37Z) - Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation
Regime [127.21287240963859]
gradient descent (SGD) has achieved great success due to its superior performance in both optimization and generalization.
This paper aims to sharply characterize the generalization of multi-pass SGD.
We show that although SGD needs more than GD to achieve the same level of excess risk, it saves the number of gradient evaluations.
arXiv Detail & Related papers (2022-03-07T06:34:53Z) - Benign Underfitting of Stochastic Gradient Descent [72.38051710389732]
We study to what extent may gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit training data.
We analyze the closely related with-replacement SGD, for which an analogous phenomenon does not occur and prove that its population risk does in fact converge at the optimal rate.
arXiv Detail & Related papers (2022-02-27T13:25:01Z) - The Benefits of Implicit Regularization from SGD in Least Squares
Problems [116.85246178212616]
gradient descent (SGD) exhibits strong algorithmic regularization effects in practice.
We make comparisons of the implicit regularization afforded by (unregularized) average SGD with the explicit regularization of ridge regression.
arXiv Detail & Related papers (2021-08-10T09:56:47Z) - Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically.
This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression.
We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z) - Optimal Iterative Sketching with the Subsampled Randomized Hadamard
Transform [64.90148466525754]
We study the performance of iterative sketching for least-squares problems.
We show that the convergence rate for Haar and randomized Hadamard matrices are identical, andally improve upon random projections.
These techniques may be applied to other algorithms that employ randomized dimension reduction.
arXiv Detail & Related papers (2020-02-03T16:17:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.