Sharper Rates and Flexible Framework for Nonconvex SGD with Client and
Data Sampling
- URL: http://arxiv.org/abs/2206.02275v1
- Date: Sun, 5 Jun 2022 21:32:33 GMT
- Title: Sharper Rates and Flexible Framework for Nonconvex SGD with Client and
Data Sampling
- Authors: Alexander Tyurin and Lukang Sun and Konstantin Burlachenko and Peter
Richt\'arik
- Abstract summary: We revisit the problem of finding an approximately stationary point of the average of $n$ smooth and possibly non-color functions.
We generalize the $smallsfcolorgreen$ so that it can provably work with virtually any sampling mechanism.
We provide the most general and most accurate analysis of optimal bound in the smooth non-color regime.
- Score: 64.31011847952006
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We revisit the classical problem of finding an approximately stationary point
of the average of $n$ smooth and possibly nonconvex functions. The optimal
complexity of stochastic first-order methods in terms of the number of gradient
evaluations of individual functions is $\mathcal{O}\left(n +
n^{1/2}\varepsilon^{-1}\right)$, attained by the optimal SGD methods
$\small\sf\color{green}{SPIDER}$(arXiv:1807.01695) and
$\small\sf\color{green}{PAGE}$(arXiv:2008.10898), for example, where
$\varepsilon$ is the error tolerance. However, i) the big-$\mathcal{O}$
notation hides crucial dependencies on the smoothness constants associated with
the functions, and ii) the rates and theory in these methods assume simplistic
sampling mechanisms that do not offer any flexibility. In this work we remedy
the situation. First, we generalize the $\small\sf\color{green}{PAGE}$
algorithm so that it can provably work with virtually any (unbiased) sampling
mechanism. This is particularly useful in federated learning, as it allows us
to construct and better understand the impact of various combinations of client
and data sampling strategies. Second, our analysis is sharper as we make
explicit use of certain novel inequalities that capture the intricate interplay
between the smoothness constants and the sampling procedure. Indeed, our
analysis is better even for the simple sampling procedure analyzed in the
$\small\sf\color{green}{PAGE}$ paper. However, this already improved bound can
be further sharpened by a different sampling scheme which we propose. In
summary, we provide the most general and most accurate analysis of optimal SGD
in the smooth nonconvex regime. Finally, our theoretical findings are supposed
with carefully designed experiments.
Related papers
- The Plug-in Approach for Average-Reward and Discounted MDPs: Optimal Sample Complexity Analysis [6.996002801232415]
We study the sample complexity of the plug-in approach for learning $varepsilon$-optimal policies in average-reward Markov decision processes.
Despite representing arguably the simplest algorithm for this problem, the plug-in approach has never been theoretically analyzed.
arXiv Detail & Related papers (2024-10-10T05:08:14Z) - Empirical Risk Minimization with Shuffled SGD: A Primal-Dual Perspective
and Improved Bounds [12.699376765058137]
gradient descent (SGD) is perhaps the most prevalent optimization method in modern machine learning.
It is only very recently that SGD with sampling without replacement -- shuffled SGD -- has been analyzed.
We prove fine-grained complexity bounds that depend on the data matrix and are never worse than what is predicted by the existing bounds.
arXiv Detail & Related papers (2023-06-21T18:14:44Z) - Achieving the Asymptotically Optimal Sample Complexity of Offline Reinforcement Learning: A DRO-Based Approach [36.88301225561535]
offline reinforcement learning aims to learn from pre-collected datasets without active exploration.
Existing approaches adopt a pessimistic stance towards uncertainty by penalizing rewards of under-explored state-action pairs to estimate value functions conservatively.
We show that the distributionally robust optimization (DRO) based approach can also address these challenges and is asymptotically minimax optimal
arXiv Detail & Related papers (2023-05-22T17:50:18Z) - Stochastic Inexact Augmented Lagrangian Method for Nonconvex Expectation
Constrained Optimization [88.0031283949404]
Many real-world problems have complicated non functional constraints and use a large number of data points.
Our proposed method outperforms an existing method with the previously best-known result.
arXiv Detail & Related papers (2022-12-19T14:48:54Z) - Multi-block-Single-probe Variance Reduced Estimator for Coupled
Compositional Optimization [49.58290066287418]
We propose a novel method named Multi-block-probe Variance Reduced (MSVR) to alleviate the complexity of compositional problems.
Our results improve upon prior ones in several aspects, including the order of sample complexities and dependence on strongity.
arXiv Detail & Related papers (2022-07-18T12:03:26Z) - Approximate Function Evaluation via Multi-Armed Bandits [51.146684847667125]
We study the problem of estimating the value of a known smooth function $f$ at an unknown point $boldsymbolmu in mathbbRn$, where each component $mu_i$ can be sampled via a noisy oracle.
We design an instance-adaptive algorithm that learns to sample according to the importance of each coordinate, and with probability at least $1-delta$ returns an $epsilon$ accurate estimate of $f(boldsymbolmu)$.
arXiv Detail & Related papers (2022-03-18T18:50:52Z) - Stochastic Bias-Reduced Gradient Methods [44.35885731095432]
We develop a new primitive for optimization: a low-bias, low-cost smoothing of ther $x_star$ of any bound of the Moreau-Yoshida function.
arXiv Detail & Related papers (2021-06-17T13:33:05Z) - Least Squares Regression with Markovian Data: Fundamental Limits and
Algorithms [69.45237691598774]
We study the problem of least squares linear regression where the data-points are dependent and are sampled from a Markov chain.
We establish sharp information theoretic minimax lower bounds for this problem in terms of $tau_mathsfmix$.
We propose an algorithm based on experience replay--a popular reinforcement learning technique--that achieves a significantly better error rate.
arXiv Detail & Related papers (2020-06-16T04:26:50Z) - Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and
Variance Reduction [63.41789556777387]
Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP)
We show that the number of samples needed to yield an entrywise $varepsilon$-accurate estimate of the Q-function is at most on the order of $frac1mu_min (1-gamma)5varepsilon2+ fract_mixmu_min (1-gamma)$ up to some logarithmic factor.
arXiv Detail & Related papers (2020-06-04T17:51:00Z) - Better Theory for SGD in the Nonconvex World [2.6397379133308214]
Large-scale non optimization problems are ubiquitous in modern machine learning.
We perform experiments on the effects of a wide array of synthetic minibatch sizes on the Gradient Descent (SG) problem.
arXiv Detail & Related papers (2020-02-09T09:56:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.