Related papers: Random Reshuffling with Variance Reduction: New Analysis and Better Rates

Random Reshuffling with Variance Reduction: New Analysis and Better Rates

URL: http://arxiv.org/abs/2104.09342v1
Date: Mon, 19 Apr 2021 14:30:10 GMT
Title: Random Reshuffling with Variance Reduction: New Analysis and Better Rates
Authors: Grigory Malinovsky, Alibek Sailanbayev, Peter Richt\'arik
Abstract summary: We provide the first analysis of Random Reshuffling (RRSVRG) for generalsum problems. We show RRSVRG can be improved to $calO(kappa3/2) in virtually widely used $mathcalO(kappa3/2) rates. This improves upon the previous best $mathcalO(kappa3/2) RRSVRG method.
Score: 0.8594140167290099
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Virtually all state-of-the-art methods for training supervised machine learning models are variants of SGD enhanced with a number of additional tricks, such as minibatching, momentum, and adaptive stepsizes. One of the tricks that works so well in practice that it is used as default in virtually all widely used machine learning software is {\em random reshuffling (RR)}. However, the practical benefits of RR have until very recently been eluding attempts at being satisfactorily explained using theory. Motivated by recent development due to Mishchenko, Khaled and Richt\'{a}rik (2020), in this work we provide the first analysis of SVRG under Random Reshuffling (RR-SVRG) for general finite-sum problems. First, we show that RR-SVRG converges linearly with the rate $\mathcal{O}(\kappa^{3/2})$ in the strongly-convex case, and can be improved further to $\mathcal{O}(\kappa)$ in the big data regime (when $n > \mathcal{O}(\kappa)$), where $\kappa$ is the condition number. This improves upon the previous best rate $\mathcal{O}(\kappa^2)$ known for a variance reduced RR method in the strongly-convex case due to Ying, Yuan and Sayed (2020). Second, we obtain the first sublinear rate for general convex problems. Third, we establish similar fast rates for Cyclic-SVRG and Shuffle-Once-SVRG. Finally, we develop and analyze a more general variance reduction scheme for RR, which allows for less frequent updates of the control variate. We corroborate our theoretical results with suitably chosen experiments on synthetic and real datasets.

Related papers

Better Rates for Random Task Orderings in Continual Linear Models [50.11453013647086]
We analyze the forgetting, i.e., loss on previously seen tasks, after $k$ iterations. We develop novel last-iterate bounds in the realizable least squares setup, and apply them to derive new results for continual learning. We prove for the first time that randomization alone, with no task repetition, can prevent catastrophic forgetting in sufficiently long task.
arXiv Detail & Related papers (2025-04-06T18:39:45Z)
Asymptotics of Random Feature Regression Beyond the Linear Scaling Regime [22.666759017118796]
Recent advances in machine learning have been achieved by using overparametrized models trained until near the training data. How does model complexity and generalization depend on the number of parameters $p$? In particular, RFRR exhibits an intuitive trade-off between approximation and generalization power.
arXiv Detail & Related papers (2024-03-13T00:59:25Z)
A New Random Reshuffling Method for Nonsmooth Nonconvex Finite-sum Optimization [6.314057999212246]
Random reshuffling techniques are used in large-scale applications, such as neural networks. In this paper, we show that the random reshuffling-type iterations generated by norm-PRR converge to a linear setting. Finally, we derive last convergence rates that can be applied to the proposed approach.
arXiv Detail & Related papers (2023-12-02T07:12:00Z)
Settling the Sample Complexity of Online Reinforcement Learning [92.02082223856479]
We show how to achieve minimax-optimal regret without incurring any burn-in cost. We extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances.
arXiv Detail & Related papers (2023-07-25T15:42:11Z)
Empirical Risk Minimization with Shuffled SGD: A Primal-Dual Perspective and Improved Bounds [12.699376765058137]
gradient descent (SGD) is perhaps the most prevalent optimization method in modern machine learning. It is only very recently that SGD with sampling without replacement -- shuffled SGD -- has been analyzed. We prove fine-grained complexity bounds that depend on the data matrix and are never worse than what is predicted by the existing bounds.
arXiv Detail & Related papers (2023-06-21T18:14:44Z)
Bounding the Width of Neural Networks via Coupled Initialization -- A Worst Case Analysis [121.9821494461427]
We show how to significantly reduce the number of neurons required for two-layer ReLU networks. We also prove new lower bounds that improve upon prior work, and that under certain assumptions, are best possible.
arXiv Detail & Related papers (2022-06-26T06:51:31Z)
DASHA: Distributed Nonconvex Optimization with Communication Compression, Optimal Oracle Complexity, and No Client Synchronization [77.34726150561087]
We develop and analyze DASHA: a new family of methods for noneps distributed optimization problems. Unlike MARINA, the new methods DASHA, DASHA-MVR send compressed vectors only and never synchronize the nodes, which makes them more practical for learning.
arXiv Detail & Related papers (2022-02-02T20:10:40Z)
Towards an Understanding of Benign Overfitting in Neural Networks [104.2956323934544]
Modern machine learning models often employ a huge number of parameters and are typically optimized to have zero training loss. We examine how these benign overfitting phenomena occur in a two-layer neural network setting. We show that it is possible for the two-layer ReLU network interpolator to achieve a near minimax-optimal learning rate.
arXiv Detail & Related papers (2021-06-06T19:08:53Z)
Robust Kernel-based Distribution Regression [13.426195476348955]
We study distribution regression (DR) which involves two stages of sampling, and aims at regressing from probability measures to real-valued responses over a kernel reproducing Hilbert space (RKHS) By introducing a robust loss function $l_sigma$ for two-stage sampling problems, we present a novel robust distribution regression (RDR) scheme.
arXiv Detail & Related papers (2021-04-21T17:03:46Z)
Proximal and Federated Random Reshuffling [11.83842808044211]
We propose two new algorithms for Random Reshuffling. ProxRR and FedRR solve composite convex finite-sum minimization problems. ProxRR is faster than algorithms that evaluate the proximal operator in every iteration.
arXiv Detail & Related papers (2021-02-12T18:59:24Z)
Hybrid Stochastic-Deterministic Minibatch Proximal Gradient: Less-Than-Single-Pass Optimization with Nearly Optimal Generalization [83.80460802169999]
We show that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model. For loss factors, we prove that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
arXiv Detail & Related papers (2020-09-18T02:18:44Z)
Variance Reduced EXTRA and DIGing and Their Optimal Acceleration for Strongly Convex Decentralized Optimization [69.49313819343992]
We extend the widely used EXTRA and DIGing methods with variance reduction (VR), and propose two methods: VR-EXTRA and VR-DIGing. The proposed VR-EXTRA requires the time of $O(kappa_s+n)logfrac1epsilon)$ gradient evaluations and $O(kappa_b+kappa_c)logfrac1epsilon)$ communication rounds. The proposed VR-DIGing has a little higher communication cost of $O(kappa_b+kappa
arXiv Detail & Related papers (2020-09-09T15:48:44Z)
Optimal Robust Linear Regression in Nearly Linear Time [97.11565882347772]
We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = langle X,w* rangle + epsilon$ We propose estimators for this problem under two settings: (i) $X$ is L4-L2 hypercontractive, $mathbbE [XXtop]$ has bounded condition number and $epsilon$ has bounded variance and (ii) $X$ is sub-Gaussian with identity second moment and $epsilon$ is
arXiv Detail & Related papers (2020-07-16T06:44:44Z)
Random Reshuffling: Simple Analysis with Vast Improvements [9.169947558498535]
Random Reshuffling (RR) is an algorithm for minimizing finite-sum functions that utilizes iterative descent steps conjunction in with data reshuffuffling.
arXiv Detail & Related papers (2020-06-10T17:57:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.