Random Reshuffling: Simple Analysis with Vast Improvements
- URL: http://arxiv.org/abs/2006.05988v3
- Date: Mon, 5 Apr 2021 11:40:26 GMT
- Title: Random Reshuffling: Simple Analysis with Vast Improvements
- Authors: Konstantin Mishchenko and Ahmed Khaled and Peter Richt\'arik
- Abstract summary: Random Reshuffling (RR) is an algorithm for minimizing finite-sum functions that utilizes iterative descent steps conjunction in with data reshuffuffling.
- Score: 9.169947558498535
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Random Reshuffling (RR) is an algorithm for minimizing finite-sum functions
that utilizes iterative gradient descent steps in conjunction with data
reshuffling. Often contrasted with its sibling Stochastic Gradient Descent
(SGD), RR is usually faster in practice and enjoys significant popularity in
convex and non-convex optimization. The convergence rate of RR has attracted
substantial attention recently and, for strongly convex and smooth functions,
it was shown to converge faster than SGD if 1) the stepsize is small, 2) the
gradients are bounded, and 3) the number of epochs is large. We remove these 3
assumptions, improve the dependence on the condition number from $\kappa^2$ to
$\kappa$ (resp. from $\kappa$ to $\sqrt{\kappa}$) and, in addition, show that
RR has a different type of variance. We argue through theory and experiments
that the new variance type gives an additional justification of the superior
performance of RR. To go beyond strong convexity, we present several results
for non-strongly convex and non-convex objectives. We show that in all cases,
our theory improves upon existing literature. Finally, we prove fast
convergence of the Shuffle-Once (SO) algorithm, which shuffles the data only
once, at the beginning of the optimization process. Our theory for
strongly-convex objectives tightly matches the known lower bounds for both RR
and SO and substantiates the common practical heuristic of shuffling once or
only a few times. As a byproduct of our analysis, we also get new results for
the Incremental Gradient algorithm (IG), which does not shuffle the data at
all.
Related papers
- On the Last-Iterate Convergence of Shuffling Gradient Methods [21.865728815935665]
We prove the first last-iterate convergence rates for shuffling gradient methods with respect to the objective value.
Our results either (nearly) match the existing last-iterate lower bounds or are as fast as the previous best upper bounds for the average iterate.
arXiv Detail & Related papers (2024-03-12T15:01:17Z) - Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - A New Random Reshuffling Method for Nonsmooth Nonconvex Finite-sum Optimization [6.314057999212246]
Random reshuffling techniques are used in large-scale applications, such as neural networks.
In this paper, we show that the random reshuffling-type iterations generated by norm-PRR converge to a linear setting.
Finally, we derive last convergence rates that can be applied to the proposed approach.
arXiv Detail & Related papers (2023-12-02T07:12:00Z) - Gradient-free optimization of highly smooth functions: improved analysis
and a new algorithm [87.22224691317766]
This work studies problems with zero-order noisy oracle information under the assumption that the objective function is highly smooth.
We consider two kinds of zero-order projected gradient descent algorithms.
arXiv Detail & Related papers (2023-06-03T17:05:13Z) - Hessian Averaging in Stochastic Newton Methods Achieves Superlinear
Convergence [69.65563161962245]
We consider a smooth and strongly convex objective function using a Newton method.
We show that there exists a universal weighted averaging scheme that transitions to local convergence at an optimal stage.
arXiv Detail & Related papers (2022-04-20T07:14:21Z) - Improved Convergence Rate of Stochastic Gradient Langevin Dynamics with
Variance Reduction and its Application to Optimization [50.83356836818667]
gradient Langevin Dynamics is one of the most fundamental algorithms to solve non-eps optimization problems.
In this paper, we show two variants of this kind, namely the Variance Reduced Langevin Dynamics and the Recursive Gradient Langevin Dynamics.
arXiv Detail & Related papers (2022-03-30T11:39:00Z) - Distributed Random Reshuffling over Networks [7.013052033764372]
A distributed resh-upr (D-RR) algorithm is proposed to solve the problem of convex and smooth objective functions.
In particular, for smooth convex objective functions, D-RR achieves D-T convergence rate (where $T counts epoch number) in terms of distance between the global drives.
arXiv Detail & Related papers (2021-12-31T03:59:37Z) - Random Reshuffling with Variance Reduction: New Analysis and Better
Rates [0.8594140167290099]
We provide the first analysis of Random Reshuffling (RRSVRG) for generalsum problems.
We show RRSVRG can be improved to $calO(kappa3/2) in virtually widely used $mathcalO(kappa3/2) rates.
This improves upon the previous best $mathcalO(kappa3/2) RRSVRG method.
arXiv Detail & Related papers (2021-04-19T14:30:10Z) - Proximal and Federated Random Reshuffling [11.83842808044211]
We propose two new algorithms for Random Reshuffling.
ProxRR and FedRR solve composite convex finite-sum minimization problems.
ProxRR is faster than algorithms that evaluate the proximal operator in every iteration.
arXiv Detail & Related papers (2021-02-12T18:59:24Z) - Gradient Free Minimax Optimization: Variance Reduction and Faster
Convergence [120.9336529957224]
In this paper, we denote the non-strongly setting on the magnitude of a gradient-free minimax optimization problem.
We show that a novel zeroth-order variance reduced descent algorithm achieves the best known query complexity.
arXiv Detail & Related papers (2020-06-16T17:55:46Z) - Exploiting Higher Order Smoothness in Derivative-free Optimization and
Continuous Bandits [99.70167985955352]
We study the problem of zero-order optimization of a strongly convex function.
We consider a randomized approximation of the projected gradient descent algorithm.
Our results imply that the zero-order algorithm is nearly optimal in terms of sample complexity and the problem parameters.
arXiv Detail & Related papers (2020-06-14T10:42:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.