High Probability Guarantees for Random Reshuffling
- URL: http://arxiv.org/abs/2311.11841v2
- Date: Fri, 8 Dec 2023 02:26:17 GMT
- Title: High Probability Guarantees for Random Reshuffling
- Authors: Hengxu Yu, Xiao Li
- Abstract summary: We consider the gradient method with random reshuffling ($mathsfRR$) for tackling nonmathp optimization problems.
In this work, we first investigate the neural sample complexity for $mathsfRR$s sampling procedure.
Then, we design a random reshuffling method ($mathsfp$mathsfRR$) that involves an additional randomized perturbation procedure stationary points.
- Score: 5.663909018247509
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the stochastic gradient method with random reshuffling
($\mathsf{RR}$) for tackling smooth nonconvex optimization problems.
$\mathsf{RR}$ finds broad applications in practice, notably in training neural
networks. In this work, we first investigate the concentration property of
$\mathsf{RR}$'s sampling procedure and establish a new high probability sample
complexity guarantee for driving the gradient (without expectation) below
$\varepsilon$, which effectively characterizes the efficiency of a single
$\mathsf{RR}$ execution. Our derived complexity matches the best existing
in-expectation one up to a logarithmic term while imposing no additional
assumptions nor changing $\mathsf{RR}$'s updating rule. Furthermore, by
leveraging our derived high probability descent property and bound on the
stochastic error, we propose a simple and computable stopping criterion for
$\mathsf{RR}$ (denoted as $\mathsf{RR}$-$\mathsf{sc}$). This criterion is
guaranteed to be triggered after a finite number of iterations, and then
$\mathsf{RR}$-$\mathsf{sc}$ returns an iterate with its gradient below
$\varepsilon$ with high probability. Moreover, building on the proposed
stopping criterion, we design a perturbed random reshuffling method
($\mathsf{p}$-$\mathsf{RR}$) that involves an additional randomized
perturbation procedure near stationary points. We derive that
$\mathsf{p}$-$\mathsf{RR}$ provably escapes strict saddle points and
efficiently returns a second-order stationary point with high probability,
without making any sub-Gaussian tail-type assumptions on the stochastic
gradient errors. Finally, we conduct numerical experiments on neural network
training to support our theoretical findings.
Related papers
- Near Optimal Heteroscedastic Regression with Symbiotic Learning [29.16456701187538]
We consider the problem of heteroscedastic linear regression.
We can estimate $mathbfw*$ in squared norm up to an error of $tildeOleft(|mathbff*|2cdot left(frac1n + left(dnright)2right)$ and prove a matching lower bound.
arXiv Detail & Related papers (2023-06-25T16:32:00Z) - Convergence of Alternating Gradient Descent for Matrix Factorization [5.439020425819001]
We consider alternating gradient descent (AGD) with fixed step size applied to the asymmetric matrix factorization objective.
We show that, for a rank-$r$ matrix $mathbfA in mathbbRm times n$, smoothness $C$ in the complexity $T$ to be an absolute constant.
arXiv Detail & Related papers (2023-05-11T16:07:47Z) - Near-Optimal Cryptographic Hardness of Agnostically Learning Halfspaces
and ReLU Regression under Gaussian Marginals [43.0867217287089]
We study the task of agnostically learning halfspaces under the Gaussian distribution.
We prove a near-optimal computational hardness result for this task.
arXiv Detail & Related papers (2023-02-13T16:46:23Z) - Learning a Single Neuron with Adversarial Label Noise via Gradient
Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations.
The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z) - On Submodular Contextual Bandits [92.45432756301231]
We consider the problem of contextual bandits where actions are subsets of a ground set and mean rewards are modeled by an unknown monotone submodular function.
We show that our algorithm efficiently randomizes around local optima of estimated functions according to the Inverse Gap Weighting strategy.
arXiv Detail & Related papers (2021-12-03T21:42:33Z) - Randomized Exploration for Reinforcement Learning with General Value
Function Approximation [122.70803181751135]
We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm.
Our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises.
We complement the theory with an empirical evaluation across known difficult exploration tasks.
arXiv Detail & Related papers (2021-06-15T02:23:07Z) - Optimal Robust Linear Regression in Nearly Linear Time [97.11565882347772]
We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = langle X,w* rangle + epsilon$
We propose estimators for this problem under two settings: (i) $X$ is L4-L2 hypercontractive, $mathbbE [XXtop]$ has bounded condition number and $epsilon$ has bounded variance and (ii) $X$ is sub-Gaussian with identity second moment and $epsilon$ is
arXiv Detail & Related papers (2020-07-16T06:44:44Z) - Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z) - Complexity of Finding Stationary Points of Nonsmooth Nonconvex Functions [84.49087114959872]
We provide the first non-asymptotic analysis for finding stationary points of nonsmooth, nonsmooth functions.
In particular, we study Hadamard semi-differentiable functions, perhaps the largest class of nonsmooth functions.
arXiv Detail & Related papers (2020-02-10T23:23:04Z) - Stochastic Recursive Gradient Descent Ascent for Stochastic
Nonconvex-Strongly-Concave Minimax Problems [36.645753881826955]
In this paper, we propose a novel method called RecurEnti Ascent (SREDA), which estimates more efficiently using variance.
This method achieves the best known for this problem.
arXiv Detail & Related papers (2020-01-11T09:05:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.