Random Reshuffling with Variance Reduction: New Analysis and Better
Rates
- URL: http://arxiv.org/abs/2104.09342v1
- Date: Mon, 19 Apr 2021 14:30:10 GMT
- Title: Random Reshuffling with Variance Reduction: New Analysis and Better
Rates
- Authors: Grigory Malinovsky, Alibek Sailanbayev, Peter Richt\'arik
- Abstract summary: We provide the first analysis of Random Reshuffling (RRSVRG) for generalsum problems.
We show RRSVRG can be improved to $calO(kappa3/2) in virtually widely used $mathcalO(kappa3/2) rates.
This improves upon the previous best $mathcalO(kappa3/2) RRSVRG method.
- Score: 0.8594140167290099
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Virtually all state-of-the-art methods for training supervised machine
learning models are variants of SGD enhanced with a number of additional
tricks, such as minibatching, momentum, and adaptive stepsizes. One of the
tricks that works so well in practice that it is used as default in virtually
all widely used machine learning software is {\em random reshuffling (RR)}.
However, the practical benefits of RR have until very recently been eluding
attempts at being satisfactorily explained using theory. Motivated by recent
development due to Mishchenko, Khaled and Richt\'{a}rik (2020), in this work we
provide the first analysis of SVRG under Random Reshuffling (RR-SVRG) for
general finite-sum problems. First, we show that RR-SVRG converges linearly
with the rate $\mathcal{O}(\kappa^{3/2})$ in the strongly-convex case, and can
be improved further to $\mathcal{O}(\kappa)$ in the big data regime (when $n >
\mathcal{O}(\kappa)$), where $\kappa$ is the condition number. This improves
upon the previous best rate $\mathcal{O}(\kappa^2)$ known for a variance
reduced RR method in the strongly-convex case due to Ying, Yuan and Sayed
(2020). Second, we obtain the first sublinear rate for general convex problems.
Third, we establish similar fast rates for Cyclic-SVRG and Shuffle-Once-SVRG.
Finally, we develop and analyze a more general variance reduction scheme for
RR, which allows for less frequent updates of the control variate. We
corroborate our theoretical results with suitably chosen experiments on
synthetic and real datasets.
Related papers
- Asymptotics of Random Feature Regression Beyond the Linear Scaling
Regime [22.666759017118796]
Recent advances in machine learning have been achieved by using overparametrized models trained until near the training data.
How does model complexity and generalization depend on the number of parameters $p$?
In particular, RFRR exhibits an intuitive trade-off between approximation and generalization power.
arXiv Detail & Related papers (2024-03-13T00:59:25Z) - A New Random Reshuffling Method for Nonsmooth Nonconvex Finite-sum Optimization [6.314057999212246]
Random reshuffling techniques are used in large-scale applications, such as neural networks.
In this paper, we show that the random reshuffling-type iterations generated by norm-PRR converge to a linear setting.
Finally, we derive last convergence rates that can be applied to the proposed approach.
arXiv Detail & Related papers (2023-12-02T07:12:00Z) - Settling the Sample Complexity of Online Reinforcement Learning [92.02082223856479]
We show how to achieve minimax-optimal regret without incurring any burn-in cost.
We extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances.
arXiv Detail & Related papers (2023-07-25T15:42:11Z) - Bounding the Width of Neural Networks via Coupled Initialization -- A
Worst Case Analysis [121.9821494461427]
We show how to significantly reduce the number of neurons required for two-layer ReLU networks.
We also prove new lower bounds that improve upon prior work, and that under certain assumptions, are best possible.
arXiv Detail & Related papers (2022-06-26T06:51:31Z) - DASHA: Distributed Nonconvex Optimization with Communication
Compression, Optimal Oracle Complexity, and No Client Synchronization [77.34726150561087]
We develop and analyze DASHA: a new family of methods for noneps distributed optimization problems.
Unlike MARINA, the new methods DASHA, DASHA-MVR send compressed vectors only and never synchronize the nodes, which makes them more practical for learning.
arXiv Detail & Related papers (2022-02-02T20:10:40Z) - Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss.
We examine how these benign overfitting phenomena occur in a two-layer neural network setting.
We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z) - Robust Kernel-based Distribution Regression [13.426195476348955]
We study distribution regression (DR) which involves two stages of sampling, and aims at regressing from probability measures to real-valued responses over a kernel reproducing Hilbert space (RKHS)
By introducing a robust loss function $l_sigma$ for two-stage sampling problems, we present a novel robust distribution regression (RDR) scheme.
arXiv Detail & Related papers (2021-04-21T17:03:46Z) - Proximal and Federated Random Reshuffling [11.83842808044211]
We propose two new algorithms for Random Reshuffling.
ProxRR and FedRR solve composite convex finite-sum minimization problems.
ProxRR is faster than algorithms that evaluate the proximal operator in every iteration.
arXiv Detail & Related papers (2021-02-12T18:59:24Z) - Variance Reduced EXTRA and DIGing and Their Optimal Acceleration for
Strongly Convex Decentralized Optimization [69.49313819343992]
We extend the widely used EXTRA and DIGing methods with variance reduction (VR), and propose two methods: VR-EXTRA and VR-DIGing.
The proposed VR-EXTRA requires the time of $O(kappa_s+n)logfrac1epsilon)$ gradient evaluations and $O(kappa_b+kappa_c)logfrac1epsilon)$ communication rounds.
The proposed VR-DIGing has a little higher communication cost of $O(kappa_b+kappa
arXiv Detail & Related papers (2020-09-09T15:48:44Z) - Optimal Robust Linear Regression in Nearly Linear Time [97.11565882347772]
We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = langle X,w* rangle + epsilon$
We propose estimators for this problem under two settings: (i) $X$ is L4-L2 hypercontractive, $mathbbE [XXtop]$ has bounded condition number and $epsilon$ has bounded variance and (ii) $X$ is sub-Gaussian with identity second moment and $epsilon$ is
arXiv Detail & Related papers (2020-07-16T06:44:44Z) - Random Reshuffling: Simple Analysis with Vast Improvements [9.169947558498535]
Random Reshuffling (RR) is an algorithm for minimizing finite-sum functions that utilizes iterative descent steps conjunction in with data reshuffuffling.
arXiv Detail & Related papers (2020-06-10T17:57:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.