A Variance-Reduced Stochastic Accelerated Primal Dual Algorithm
- URL: http://arxiv.org/abs/2202.09688v1
- Date: Sat, 19 Feb 2022 22:12:30 GMT
- Title: A Variance-Reduced Stochastic Accelerated Primal Dual Algorithm
- Authors: Bugra Can, Mert Gurbuzbalaban, Necdet Serhat Aybat
- Abstract summary: Such problems arise frequently in machine learning in the context of robust empirical risk minimization.
We consider the accelerated primal dual (SAPD) algorithm as a robust method against gradient noise.
We show that our method improves upon SAPD both in practice and in theory.
- Score: 3.2958527541557525
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we consider strongly convex strongly concave (SCSC) saddle
point (SP) problems
$\min_{x\in\mathbb{R}^{d_x}}\max_{y\in\mathbb{R}^{d_y}}f(x,y)$ where $f$ is
$L$-smooth, $f(.,y)$ is $\mu$-strongly convex for every $y$, and $f(x,.)$ is
$\mu$-strongly concave for every $x$. Such problems arise frequently in machine
learning in the context of robust empirical risk minimization (ERM), e.g.
$\textit{distributionally robust}$ ERM, where partial gradients are estimated
using mini-batches of data points. Assuming we have access to an unbiased
stochastic first-order oracle we consider the stochastic accelerated primal
dual (SAPD) algorithm recently introduced in Zhang et al. [2021] for SCSC SP
problems as a robust method against gradient noise. In particular, SAPD
recovers the well-known stochastic gradient descent ascent (SGDA) as a special
case when the momentum parameter is set to zero and can achieve an accelerated
rate when the momentum parameter is properly tuned, i.e., improving the $\kappa
\triangleq L/\mu$ dependence from $\kappa^2$ for SGDA to $\kappa$. We propose
efficient variance-reduction strategies for SAPD based on Richardson-Romberg
extrapolation and show that our method improves upon SAPD both in practice and
in theory.
Related papers
- Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Second-order Information Promotes Mini-Batch Robustness in Variance-Reduced Gradients [0.196629787330046]
We show that incorporating partial second-order information of the objective function can dramatically improve robustness to mini-batch size of variance-reduced gradient methods.
We demonstrate this phenomenon on a prototypical Newton ($textttMb-SVRN$) algorithm.
arXiv Detail & Related papers (2024-04-23T05:45:52Z) - Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - Adaptive Stochastic Variance Reduction for Non-convex Finite-Sum
Minimization [52.25843977506935]
We propose an adaptive variance method, called AdaSpider, for $L$-smooth, non-reduction functions with a finitesum structure.
In doing so, we are able to compute an $epsilon-stationary point with $tildeOleft + st/epsilon calls.
arXiv Detail & Related papers (2022-11-03T14:41:46Z) - On Acceleration of Gradient-Based Empirical Risk Minimization using
Local Polynomial Regression [0.491574468325115]
We study the acceleration of the Local Polynomial Interpolation-based Gradient Descent method (LPIGD) recently proposed for the approximate solution empirical risk problems (ERM)
We focus on loss functions that are strongly convex and smooth with condition number $sigma$.
We propose two accelerated methods for the problem based on LPI-GD and show an oracle complexity of $tildeOleft(sqrtsigma md log (1/varepsilon)$.
arXiv Detail & Related papers (2022-04-16T02:39:45Z) - Towards Noise-adaptive, Problem-adaptive Stochastic Gradient Descent [7.176107039687231]
We design step-size schemes that make gradient descent (SGD) adaptive to (i) the noise.
We prove that $T$ iterations of SGD with Nesterov iterations can be near optimal.
Compared to other step-size schemes, we demonstrate the effectiveness of a novel novel exponential step-size scheme.
arXiv Detail & Related papers (2021-10-21T19:22:14Z) - What Happens after SGD Reaches Zero Loss? --A Mathematical Framework [35.31946061894308]
Understanding the implicit bias of Gradient Descent (SGD) is one of the key challenges in deep learning.
This paper gives a general framework for such analysis by adapting ideas from Katzenberger (1991).
It yields some new results: (1) a global analysis of the implicit bias valid for $eta-2$ steps, in contrast to the local analysis of Blanc et al. ( 2020) that is only valid for $eta-1.6$ steps and (2) allowing arbitrary noise covariance.
arXiv Detail & Related papers (2021-10-13T17:50:46Z) - Asynchronous Stochastic Optimization Robust to Arbitrary Delays [54.61797739710608]
We consider optimization with delayed gradients where, at each time stept$, the algorithm makes an update using a stale computation - d_t$ for arbitrary delay $d_t gradient.
Our experiments demonstrate the efficacy and robustness of our algorithm in cases where the delay distribution is skewed or heavy-tailed.
arXiv Detail & Related papers (2021-06-22T15:50:45Z) - Non-Euclidean Differentially Private Stochastic Convex Optimization [15.302167005107135]
We show that noisy gradient descent (SGD) algorithms attain the optimal excess risk in low-dimensional regimes.
Our work draws upon concepts from the geometry of normed spaces, such as the notions of regularity, uniform convexity, and uniform smoothness.
arXiv Detail & Related papers (2021-03-01T19:48:44Z) - Optimal Robust Linear Regression in Nearly Linear Time [97.11565882347772]
We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = langle X,w* rangle + epsilon$
We propose estimators for this problem under two settings: (i) $X$ is L4-L2 hypercontractive, $mathbbE [XXtop]$ has bounded condition number and $epsilon$ has bounded variance and (ii) $X$ is sub-Gaussian with identity second moment and $epsilon$ is
arXiv Detail & Related papers (2020-07-16T06:44:44Z) - On the Almost Sure Convergence of Stochastic Gradient Descent in
Non-Convex Problems [75.58134963501094]
This paper analyzes the trajectories of gradient descent (SGD)
We show that SGD avoids saddle points/manifolds with $1$ for strict step-size policies.
arXiv Detail & Related papers (2020-06-19T14:11:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.