SVRG Meets AdaGrad: Painless Variance Reduction
- URL: http://arxiv.org/abs/2102.09645v1
- Date: Thu, 18 Feb 2021 22:26:19 GMT
- Title: SVRG Meets AdaGrad: Painless Variance Reduction
- Authors: Benjamin Dubois-Taine, Sharan Vaswani, Reza Babanezhad, Mark Schmidt,
Simon Lacoste-Julien
- Abstract summary: We propose a fully adaptive variant of SVRG, a common VR method.
AdaSVRG uses AdaGrad in the inner loop of SVRG, making it robust to the choice of step-size.
We validate the robustness and effectiveness of AdaSVRG, demonstrating its superior performance over other "tune-free" VR methods.
- Score: 34.42463428418348
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Variance reduction (VR) methods for finite-sum minimization typically require
the knowledge of problem-dependent constants that are often unknown and
difficult to estimate. To address this, we use ideas from adaptive gradient
methods to propose AdaSVRG, which is a fully adaptive variant of SVRG, a common
VR method. AdaSVRG uses AdaGrad in the inner loop of SVRG, making it robust to
the choice of step-size, and allowing it to adaptively determine the length of
each inner-loop. When minimizing a sum of $n$ smooth convex functions, we prove
that AdaSVRG requires $O(n + 1/\epsilon)$ gradient evaluations to achieve an
$\epsilon$-suboptimality, matching the typical rate, but without needing to
know problem-dependent constants. However, VR methods including AdaSVRG are
slower than SGD when used with over-parameterized models capable of
interpolating the training data. Hence, we also propose a hybrid algorithm that
can adaptively switch from AdaGrad to AdaSVRG, achieving the best of both
stochastic gradient and VR methods, but without needing to tune their
step-sizes. Via experiments on synthetic and standard real-world datasets, we
validate the robustness and effectiveness of AdaSVRG, demonstrating its
superior performance over other "tune-free" VR methods.
Related papers
- A Coefficient Makes SVRG Effective [55.104068027239656]
Variance Reduced Gradient (SVRG) is a theoretically compelling optimization method.
In this work, we demonstrate the potential of SVRG in optimizing real-world neural networks.
Our analysis finds that, for deeper networks, the strength of the variance reduction term in SVRG should be smaller and decrease as training progresses.
arXiv Detail & Related papers (2023-11-09T18:47:44Z) - ELRA: Exponential learning rate adaption gradient descent optimization
method [83.88591755871734]
We present a novel, fast (exponential rate), ab initio (hyper-free) gradient based adaption.
The main idea of the method is to adapt the $alpha by situational awareness.
It can be applied to problems of any dimensions n and scales only linearly.
arXiv Detail & Related papers (2023-09-12T14:36:13Z) - Enhanced Adaptive Gradient Algorithms for Nonconvex-PL Minimax Optimization [41.28002701420715]
Minimax optimization has been widely applied in many machine learning tasks.
We show that our methods have the best known sample complexity without relying on any specific types.
arXiv Detail & Related papers (2023-03-07T15:33:12Z) - Greedy based Value Representation for Optimal Coordination in
Multi-agent Reinforcement Learning [64.05646120624287]
We derive the expression of the joint Q value function of LVD and MVD.
To ensure optimal consistency, the optimal node is required to be the unique STN.
Our method outperforms state-of-the-art baselines in experiments on various benchmarks.
arXiv Detail & Related papers (2022-11-22T08:14:50Z) - Adaptive Federated Minimax Optimization with Lower Complexities [82.51223883622552]
We propose an efficient adaptive minimax optimization algorithm (i.e., AdaFGDA) to solve these minimax problems.
It builds our momentum-based reduced and localSGD techniques, and it flexibly incorporate various adaptive learning rates.
arXiv Detail & Related papers (2022-11-14T12:32:18Z) - Simple and Optimal Stochastic Gradient Methods for Nonsmooth Nonconvex
Optimization [23.447011255046835]
We propose and analyze several gradient algorithms for finding stationary points or local minimum in nonsmooth regularizer, finite-sum and online optimization problems.
First, we propose a simple proximal optimal gradient algorithm based on a variance of reduction called ProxSVRG+.
We show that our algorithm is almost as simple as its counterpart, and achieves similar optimal rates.
arXiv Detail & Related papers (2022-08-22T02:40:35Z) - Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction [25.94147708122371]
We propose two novel adaptive VR algorithms: Adaptive Variance Reduced Accelerated Extra-Gradient (AdaVRAE) and Adaptive Variance Reduced Accelerated Gradient (AdaVRAG)
Our algorithms do not require knowledge of the smoothness parameter.
We demonstrate the superior performance of our algorithms compared with previous methods in experiments on real-world datasets.
arXiv Detail & Related papers (2022-01-28T18:07:25Z) - Randomized Stochastic Gradient Descent Ascent [37.887266927498395]
An increasing number of machine learning problems, such as robust or adversarial variants of existing algorithms, require minimizing a loss function.
We propose RSGDA (Randomized SGD), a variant of ESGDA with loop size with a simpler theoretical analysis.
arXiv Detail & Related papers (2021-11-25T16:44:19Z) - AdaGDA: Faster Adaptive Gradient Descent Ascent Methods for Minimax
Optimization [104.96004056928474]
We propose a class of faster adaptive gradient descent methods for non-strongly-concave minimax problems.
We show that our method reaches a lower sample complexity of $O(kappa2.5epsilon-3)$ with the mini-batch size $O(kappa)$.
arXiv Detail & Related papers (2021-06-30T14:47:09Z) - Variance Reduced EXTRA and DIGing and Their Optimal Acceleration for
Strongly Convex Decentralized Optimization [69.49313819343992]
We extend the widely used EXTRA and DIGing methods with variance reduction (VR), and propose two methods: VR-EXTRA and VR-DIGing.
The proposed VR-EXTRA requires the time of $O(kappa_s+n)logfrac1epsilon)$ gradient evaluations and $O(kappa_b+kappa_c)logfrac1epsilon)$ communication rounds.
The proposed VR-DIGing has a little higher communication cost of $O(kappa_b+kappa
arXiv Detail & Related papers (2020-09-09T15:48:44Z) - Variance reduction for Riemannian non-convex optimization with batch
size adaptation [36.79189106909088]
Variance experiments are popular accelerating batch descent techniques.
We show that this strategy can achieve both lower complexities for total convergence functions under both finite-sum and online settings.
Specifically, we prove that R-SRG bounds the same near as R-IDER without requiring a small step size.
arXiv Detail & Related papers (2020-07-03T04:34:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.