Characterizing & Finding Good Data Orderings for Fast Convergence of
Sequential Gradient Methods
- URL: http://arxiv.org/abs/2202.01838v1
- Date: Thu, 3 Feb 2022 20:38:42 GMT
- Title: Characterizing & Finding Good Data Orderings for Fast Convergence of
Sequential Gradient Methods
- Authors: Amirkeivan Mohtashami Sebastian Stich Martin Jaggi
- Abstract summary: We quantify the effect of order on convergence speed, obtaining convergence bounds based on the chosen sequence of permutations.
We develop a greedy algorithm for choosing good orders during training, achieving superior performance (by more than 14 percent in accuracy) over RR.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While SGD, which samples from the data with replacement is widely studied in
theory, a variant called Random Reshuffling (RR) is more common in practice. RR
iterates through random permutations of the dataset and has been shown to
converge faster than SGD. When the order is chosen deterministically, a variant
called incremental gradient descent (IG), the existing convergence bounds show
improvement over SGD but are worse than RR. However, these bounds do not
differentiate between a good and a bad ordering and hold for the worst choice
of order. Meanwhile, in some cases, choosing the right order when using IG can
lead to convergence faster than RR. In this work, we quantify the effect of
order on convergence speed, obtaining convergence bounds based on the chosen
sequence of permutations while also recovering previous results for RR. In
addition, we show benefits of using structured shuffling when various levels of
abstractions (e.g. tasks, classes, augmentations, etc.) exists in the dataset
in theory and in practice. Finally, relying on our measure, we develop a greedy
algorithm for choosing good orders during training, achieving superior
performance (by more than 14 percent in accuracy) over RR.
Related papers
- Ringleader ASGD: The First Asynchronous SGD with Optimal Time Complexity under Data Heterogeneity [51.56484100374058]
We introduce Ringleader ASGD, the first asynchronous algorithm that attains the theoretical lower bounds for parallel computation.<n>Our analysis further establishes that Ringleader ASGD remains optimal under arbitrary gradient and even time-varying speeds.
arXiv Detail & Related papers (2025-09-26T19:19:15Z) - Efficient Differentiable Approximation of Generalized Low-rank Regularization [64.73416824444328]
Low-rank regularization (LRR) has been widely applied in various machine learning tasks.<n>In this paper, we propose an efficient differentiable approximation of LRR.
arXiv Detail & Related papers (2025-05-21T11:49:17Z) - MindFlayer: Efficient Asynchronous Parallel SGD in the Presence of Heterogeneous and Random Worker Compute Times [49.1574468325115]
We study the problem of minimizing the expectation of smooth non functions with the help of several parallel workers.
We propose a new asynchronous SGD method, Mindlayer SGD, in which the noise is heavy tailed.
Our theory empirical results demonstrate the superiority of Mindlayer SGD in cases when the noise is heavy tailed.
arXiv Detail & Related papers (2024-10-05T21:11:32Z) - Stochastic Extragradient with Random Reshuffling: Improved Convergence for Variational Inequalities [40.1331539540764]
We provide a convergence analysis of SEG with Random Reshuffling (SEG-RR) for three classes of VIPs.
We derive conditions under which SEG-RR achieves a faster convergence rate than the uniform with-replacement sampling SEG.
In the monotone setting, our analysis of SEG-RR guarantees convergence to an arbitrary accuracy without large batch sizes.
arXiv Detail & Related papers (2024-03-11T20:35:52Z) - Last-Iterate Convergence of Adaptive Riemannian Gradient Descent for Equilibrium Computation [52.73824786627612]
This paper establishes new convergence results for textitgeodesic strongly monotone games.<n>Our key result shows that RGD attains last-iterate linear convergence in a textitgeometry-agnostic fashion.<n>Overall, this paper presents the first geometry-agnostic last-iterate convergence analysis for games beyond the Euclidean settings.
arXiv Detail & Related papers (2023-06-29T01:20:44Z) - Fast Convergence of Random Reshuffling under Over-Parameterization and
the Polyak-\L ojasiewicz Condition [41.58274719943315]
We study the convergence properties of a sampling-without- gradient descent (SGD) known as random reshuffling (RR)
RR chooses a random permutation of data at the beginning of each epoch and each chooses the next sample from the permutation.
arXiv Detail & Related papers (2023-04-02T06:00:29Z) - Coordinating Distributed Example Orders for Provably Accelerated
Training [39.05759866984658]
We propose Coordinated Distributed GraB (CD-GraB) to translate the benefits of permutation-based example ordering to distributed settings.
With negligible overhead, CD-GraB exhibits a linear speedup in convergence rate over centralized GraB and outperforms distributed RR on a variety of benchmark tasks.
arXiv Detail & Related papers (2023-02-02T03:15:29Z) - Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and
Beyond [63.59034509960994]
We study shuffling-based variants: minibatch and local Random Reshuffling, which draw gradients without replacement.
For smooth functions satisfying the Polyak-Lojasiewicz condition, we obtain convergence bounds which show that these shuffling-based variants converge faster than their with-replacement counterparts.
We propose an algorithmic modification called synchronized shuffling that leads to convergence rates faster than our lower bounds in near-homogeneous settings.
arXiv Detail & Related papers (2021-10-20T02:25:25Z) - Distributed stochastic optimization with large delays [59.95552973784946]
One of the most widely used methods for solving large-scale optimization problems is distributed asynchronous gradient descent (DASGD)
We show that DASGD converges to a global optimal implementation model under same delay assumptions.
arXiv Detail & Related papers (2021-07-06T21:59:49Z) - Benign Overfitting of Constant-Stepsize SGD for Linear Regression [122.70478935214128]
inductive biases are central in preventing overfitting empirically.
This work considers this issue in arguably the most basic setting: constant-stepsize SGD for linear regression.
We reflect on a number of notable differences between the algorithmic regularization afforded by (unregularized) SGD in comparison to ordinary least squares.
arXiv Detail & Related papers (2021-03-23T17:15:53Z) - Stochastic Reweighted Gradient Descent [4.355567556995855]
We propose an importance-sampling-based algorithm we call SRG (stochastic reweighted gradient)
We pay particular attention to the time and memory overhead of our proposed method.
We present empirical results to support our findings.
arXiv Detail & Related papers (2021-03-23T04:09:43Z) - ROOT-SGD: Sharp Nonasymptotics and Near-Optimal Asymptotics in a Single Algorithm [71.13558000599839]
We study the problem of solving strongly convex and smooth unconstrained optimization problems using first-order algorithms.
We devise a novel, referred to as Recursive One-Over-T SGD, based on an easily implementable, averaging of past gradients.
We prove that it simultaneously achieves state-of-the-art performance in both a finite-sample, nonasymptotic sense and an sense.
arXiv Detail & Related papers (2020-08-28T14:46:56Z) - When Does Preconditioning Help or Hurt Generalization? [74.25170084614098]
We show how the textitimplicit bias of first and second order methods affects the comparison of generalization properties.
We discuss several approaches to manage the bias-variance tradeoff, and the potential benefit of interpolating between GD and NGD.
arXiv Detail & Related papers (2020-06-18T17:57:26Z) - Random Reshuffling: Simple Analysis with Vast Improvements [9.169947558498535]
Random Reshuffling (RR) is an algorithm for minimizing finite-sum functions that utilizes iterative descent steps conjunction in with data reshuffuffling.
arXiv Detail & Related papers (2020-06-10T17:57:21Z) - Top-k Training of GANs: Improving GAN Performance by Throwing Away Bad
Samples [67.11669996924671]
We introduce a simple (one line of code) modification to the Generative Adversarial Network (GAN) training algorithm.
When updating the generator parameters, we zero out the gradient contributions from the elements of the batch that the critic scores as least realistic'
We show that this top-k update' procedure is a generally applicable improvement.
arXiv Detail & Related papers (2020-02-14T19:27:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.