Related papers: ANITA: An Optimal Loopless Accelerated Variance-Reduced Gradient Method

ANITA: An Optimal Loopless Accelerated Variance-Reduced Gradient Method

URL: http://arxiv.org/abs/2103.11333v1
Date: Sun, 21 Mar 2021 08:14:40 GMT
Title: ANITA: An Optimal Loopless Accelerated Variance-Reduced Gradient Method
Authors: Zhize Li
Abstract summary: We propose a novel accelerated variance-reduced method called ANITA for finite-sum optimization. In the general convex setting, ANITA achieves the convergence result $Obig(nminbig1+logfrac1epsilonsqrtn). In the strongly convex setting, we also show that ANITA can achieve the optimal convergence result $OBig(big(n+sqrtfracnLmubig)logfracnLmubig)$ matching the lower bound $
Score: 17.259824817932294
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a novel accelerated variance-reduced gradient method called ANITA for finite-sum optimization. In this paper, we consider both general convex and strongly convex settings. In the general convex setting, ANITA achieves the convergence result $O\big(n\min\big\{1+\log\frac{1}{\epsilon\sqrt{n}}, \log\sqrt{n}\big\} + \sqrt{\frac{nL}{\epsilon}} \big)$, which improves the previous best result $O\big(n\min\{\log\frac{1}{\epsilon}, \log n\}+\sqrt{\frac{nL}{\epsilon}}\big)$ given by Varag (Lan et al., 2019). In particular, for a very wide range of $\epsilon$, i.e., $\epsilon \in (0,\frac{L}{n\log^2\sqrt{n}}]\cup [\frac{1}{\sqrt{n}},+\infty)$, where $\epsilon$ is the error tolerance $f(x_T)-f^*\leq \epsilon$ and $n$ is the number of data samples, ANITA can achieve the optimal convergence result $O\big(n+\sqrt{\frac{nL}{\epsilon}}\big)$ matching the lower bound $\Omega\big(n+\sqrt{\frac{nL}{\epsilon}}\big)$ provided by Woodworth and Srebro (2016). To the best of our knowledge, ANITA is the \emph{first} accelerated algorithm which can \emph{exactly} achieve this optimal result $O\big(n+\sqrt{\frac{nL}{\epsilon}}\big)$ for general convex finite-sum problems. In the strongly convex setting, we also show that ANITA can achieve the optimal convergence result $O\Big(\big(n+\sqrt{\frac{nL}{\mu}}\big)\log\frac{1}{\epsilon}\Big)$ matching the lower bound $\Omega\Big(\big(n+\sqrt{\frac{nL}{\mu}}\big)\log\frac{1}{\epsilon}\Big)$ provided by Lan and Zhou (2015). Moreover, ANITA enjoys a simpler loopless algorithmic structure unlike previous accelerated algorithms such as Katyusha (Allen-Zhu, 2017) and Varag (Lan et al., 2019) where they use an inconvenient double-loop structure. Finally, the experimental results also show that ANITA converges faster than previous state-of-the-art Varag (Lan et al., 2019), validating our theoretical results and confirming the practical superiority of ANITA.

Related papers

On the $O(\rac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm [54.28350823319057]
This paper establishes the convergence rate $frac1Ksum_k=1KEleft[|nabla f(xk)|_1right]leq O(fracsqrtdCK1/4) for AdamW measured by $ell_$ norm, where $K$ represents the iteration number, $d denotes the model dimension, and $C$ matches the constant in the optimal convergence rate of SGD.
arXiv Detail & Related papers (2025-05-17T05:02:52Z)
A note on quantum lower bounds for local search via congestion and expansion [4.68073705539907]
We show that the quantum query complexity of local search on $G$ is $Omegabigl( fracnfrac34sqrtg bigr)$. In contrast to the classical setting, a gap remains in the quantum case between our lower bound and the best-known upper bound of $Obigl( nfrac13 bigr)$ for such graphs.
arXiv Detail & Related papers (2024-12-17T21:42:42Z)
Nearly Optimal Regret for Decentralized Online Convex Optimization [53.433398074919]
Decentralized online convex optimization (D-OCO) aims to minimize a sequence of global loss functions using only local computations and communications. We develop novel D-OCO algorithms that can respectively reduce the regret bounds for convex and strongly convex functions. Our algorithms are nearly optimal in terms of $T$, $n$, and $rho$.
arXiv Detail & Related papers (2024-02-14T13:44:16Z)
On the $O(\rac{\sqrt{d}}{T^{1/4}})$ Convergence Rate of RMSProp and Its Momentum Extension Measured by $\ell_1$ Norm [54.28350823319057]
This paper considers the RMSProp and its momentum extension and establishes the convergence rate of $frac1Tsum_k=1T.<n>Our convergence rate matches the lower bound with respect to all the coefficients except the dimension $d$.<n>Our convergence rate can be considered to be analogous to the $frac1Tsum_k=1T.
arXiv Detail & Related papers (2024-02-01T07:21:32Z)
Faster Stochastic Algorithms for Minimax Optimization under Polyak--{\L}ojasiewicz Conditions [12.459354707528819]
We propose SPIDER-GDA for solving the finite-sum problem of the form $min_x max_y f(x,y)triqangle frac1n sum_i=1n f_i(x,y)$. We prove SPIDER-GDA could find an $epsilon$-optimal solution within $mathcal Oleft((n + sqrtn,kappa_xkappa_y2)log (1/epsilon)
arXiv Detail & Related papers (2023-07-29T02:26:31Z)
Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds [26.277745106128197]
In this work, we address the challenge of such rewards in reinforcement learning with linear function approximation. We first design an algorithm, textscHeavy-OFUL, for heavy-tailed linear bandits, achieving an emphinstance-dependent $T$-round regret of $tildeObig. Our result is achieved via a novel self-normalized concentration inequality that may be of independent interest in handling heavytailed noise in general online regression problems.
arXiv Detail & Related papers (2023-06-12T02:56:09Z)
Faster Rates of Convergence to Stationary Points in Differentially Private Optimization [31.46358820048179]
We study the problem of approximating stationary points of Lipschitz and smooth functions under $(varepsilon,delta)$-differential privacy (DP) A point $widehatw$ is called an $alpha$-stationary point of a function $mathbbRdrightarrowmathbbR$ if $|nabla F(widehatw)|leq alpha$. We provide a new efficient algorithm that finds an $tildeObig(big[
arXiv Detail & Related papers (2022-06-02T02:43:44Z)
Faster Rates of Differentially Private Stochastic Convex Optimization [7.93728520583825]
We study the case where the population risk function satisfies the Tysbakov Noise Condition (TNC) with some parameter $theta>1$. In the second part, we focus on a special case where the population risk function is strongly convex.
arXiv Detail & Related papers (2021-07-31T22:23:39Z)
Nearly Horizon-Free Offline Reinforcement Learning [97.36751930393245]
We revisit offline reinforcement learning on episodic time-homogeneous Markov Decision Processes with $S$ states, $A$ actions and planning horizon $H$. We obtain the first set of nearly $H$-free sample complexity bounds for evaluation and planning using the empirical MDPs.
arXiv Detail & Related papers (2021-03-25T18:52:17Z)
Private Stochastic Convex Optimization: Optimal Rates in $\ell_1$ Geometry [69.24618367447101]
Up to logarithmic factors the optimal excess population loss of any $(varepsilon,delta)$-differently private is $sqrtlog(d)/n + sqrtd/varepsilon n.$ We show that when the loss functions satisfy additional smoothness assumptions, the excess loss is upper bounded (up to logarithmic factors) by $sqrtlog(d)/n + (log(d)/varepsilon n)2/3.
arXiv Detail & Related papers (2021-03-02T06:53:44Z)
Hybrid Stochastic-Deterministic Minibatch Proximal Gradient: Less-Than-Single-Pass Optimization with Nearly Optimal Generalization [83.80460802169999]
We show that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model. For loss factors, we prove that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
arXiv Detail & Related papers (2020-09-18T02:18:44Z)
An Optimal Separation of Randomized and Quantum Query Complexity [67.19751155411075]
We prove that for every decision tree, the absolute values of the Fourier coefficients of a given order $ellsqrtbinomdell (1+log n)ell-1,$ sum to at most $cellsqrtbinomdell (1+log n)ell-1,$ where $n$ is the number of variables, $d$ is the tree depth, and $c>0$ is an absolute constant.
arXiv Detail & Related papers (2020-08-24T06:50:57Z)
Improved Algorithms for Convex-Concave Minimax Optimization [10.28639483137346]
This paper studies minimax optimization problems $min_x max_y f(x,y)$, where $f(x,y)$ is $m_x$-strongly convex with respect to $x$, $m_y$-strongly concave with respect to $y$ and $(L_x,L_xy,L_y)$-smooth.
arXiv Detail & Related papers (2020-06-11T12:21:13Z)
Acceleration for Compressed Gradient Descent in Distributed and Federated Optimization [31.066505042748826]
We propose the first accelerated compressed gradient descent (ACGD) methods. We show that ACGD enjoys the rate $OBig( (1+omega)sqrtfracLmulog frac1epsilonBig)$ for $mu$-strongly convex problems. We also propose a distributed variant of ACGD (called ADIANA) and prove the convergence rate $widetildeOBig(+sqrtfracLmu+sqrtbig)$.
arXiv Detail & Related papers (2020-02-26T09:03:23Z)
Revisiting EXTRA for Smooth Distributed Optimization [70.65867695317633]
We give a sharp complexity analysis for EXTRA with the improved $Oleft(left(fracLmu+frac11-sigma_2(W)right)logfrac1epsilon (1-sigma_2(W))right)$. Our communication complexities of the accelerated EXTRA are only worse by the factors of $left(logfracLmu (1-sigma_2(W))right)$ and $left(logfrac1epsilon (1-
arXiv Detail & Related papers (2020-02-24T08:07:08Z)
On the Complexity of Minimizing Convex Finite Sums Without Using the Indices of the Individual Functions [62.01594253618911]
We exploit the finite noise structure of finite sums to derive a matching $O(n2)$-upper bound under the global oracle model. Following a similar approach, we propose a novel adaptation of SVRG which is both emphcompatible with oracles, and achieves complexity bounds of $tildeO(n2+nsqrtL/mu)log (1/epsilon)$ and $O(nsqrtL/epsilon)$, for $mu>0$ and $mu=0$
arXiv Detail & Related papers (2020-02-09T03:39:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.