Related papers: SVRG Meets AdaGrad: Painless Variance Reduction

SVRG Meets AdaGrad: Painless Variance Reduction

URL: http://arxiv.org/abs/2102.09645v1
Date: Thu, 18 Feb 2021 22:26:19 GMT
Title: SVRG Meets AdaGrad: Painless Variance Reduction
Authors: Benjamin Dubois-Taine, Sharan Vaswani, Reza Babanezhad, Mark Schmidt, Simon Lacoste-Julien
Abstract summary: We propose a fully adaptive variant of SVRG, a common VR method. AdaSVRG uses AdaGrad in the inner loop of SVRG, making it robust to the choice of step-size. We validate the robustness and effectiveness of AdaSVRG, demonstrating its superior performance over other "tune-free" VR methods.
Score: 34.42463428418348
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Variance reduction (VR) methods for finite-sum minimization typically require the knowledge of problem-dependent constants that are often unknown and difficult to estimate. To address this, we use ideas from adaptive gradient methods to propose AdaSVRG, which is a fully adaptive variant of SVRG, a common VR method. AdaSVRG uses AdaGrad in the inner loop of SVRG, making it robust to the choice of step-size, and allowing it to adaptively determine the length of each inner-loop. When minimizing a sum of $n$ smooth convex functions, we prove that AdaSVRG requires $O(n + 1/\epsilon)$ gradient evaluations to achieve an $\epsilon$-suboptimality, matching the typical rate, but without needing to know problem-dependent constants. However, VR methods including AdaSVRG are slower than SGD when used with over-parameterized models capable of interpolating the training data. Hence, we also propose a hybrid algorithm that can adaptively switch from AdaGrad to AdaSVRG, achieving the best of both stochastic gradient and VR methods, but without needing to tune their step-sizes. Via experiments on synthetic and standard real-world datasets, we validate the robustness and effectiveness of AdaSVRG, demonstrating its superior performance over other "tune-free" VR methods.

Related papers

VAMO: Efficient Large-Scale Nonconvex Optimization via Adaptive Zeroth Order Variance Reduction [3.130722489512822]
VAMO combines FO mini-batch gradients with ZO finite-difference probes under an ZOG-style framework.<n>VAMO outperforms established FO and ZO methods, offering a faster, more flexible option for improved efficiency.
arXiv Detail & Related papers (2025-05-20T05:31:15Z)
A Coefficient Makes SVRG Effective [55.104068027239656]
Variance Reduced Gradient (SVRG) is a theoretically compelling optimization method. In this work, we demonstrate the potential of SVRG in optimizing real-world neural networks. Our analysis finds that, for deeper networks, the strength of the variance reduction term in SVRG should be smaller and decrease as training progresses.
arXiv Detail & Related papers (2023-11-09T18:47:44Z)
ELRA: Exponential learning rate adaption gradient descent optimization method [83.88591755871734]
We present a novel, fast (exponential rate), ab initio (hyper-free) gradient based adaption. The main idea of the method is to adapt the $alpha by situational awareness. It can be applied to problems of any dimensions n and scales only linearly.
arXiv Detail & Related papers (2023-09-12T14:36:13Z)
Enhanced Adaptive Gradient Algorithms for Nonconvex-PL Minimax Optimization [41.28002701420715]
Minimax optimization has been widely applied in many machine learning tasks. We show that our methods have the best known sample complexity without relying on any specific types.
arXiv Detail & Related papers (2023-03-07T15:33:12Z)
Greedy based Value Representation for Optimal Coordination in Multi-agent Reinforcement Learning [64.05646120624287]
We derive the expression of the joint Q value function of LVD and MVD. To ensure optimal consistency, the optimal node is required to be the unique STN. Our method outperforms state-of-the-art baselines in experiments on various benchmarks.
arXiv Detail & Related papers (2022-11-22T08:14:50Z)
Adaptive Federated Minimax Optimization with Lower Complexities [82.51223883622552]
We propose an efficient adaptive minimax optimization algorithm (i.e., AdaFGDA) to solve these minimax problems. It builds our momentum-based reduced and localSGD techniques, and it flexibly incorporate various adaptive learning rates.
arXiv Detail & Related papers (2022-11-14T12:32:18Z)
Simple and Optimal Stochastic Gradient Methods for Nonsmooth Nonconvex Optimization [23.447011255046835]
We propose and analyze several gradient algorithms for finding stationary points or local minimum in nonsmooth regularizer, finite-sum and online optimization problems. First, we propose a simple proximal optimal gradient algorithm based on a variance of reduction called ProxSVRG+. We show that our algorithm is almost as simple as its counterpart, and achieves similar optimal rates.
arXiv Detail & Related papers (2022-08-22T02:40:35Z)
Adaptive Accelerated (Extra-)Gradient Methods with Variance Reduction [25.94147708122371]
We propose two novel adaptive VR algorithms: Adaptive Variance Reduced Accelerated Extra-Gradient (AdaVRAE) and Adaptive Variance Reduced Accelerated Gradient (AdaVRAG) Our algorithms do not require knowledge of the smoothness parameter. We demonstrate the superior performance of our algorithms compared with previous methods in experiments on real-world datasets.
arXiv Detail & Related papers (2022-01-28T18:07:25Z)
Randomized Stochastic Gradient Descent Ascent [37.887266927498395]
An increasing number of machine learning problems, such as robust or adversarial variants of existing algorithms, require minimizing a loss function. We propose RSGDA (Randomized SGD), a variant of ESGDA with loop size with a simpler theoretical analysis.
arXiv Detail & Related papers (2021-11-25T16:44:19Z)
AdaGDA: Faster Adaptive Gradient Descent Ascent Methods for Minimax Optimization [104.96004056928474]
We propose a class of faster adaptive gradient descent methods for non-strongly-concave minimax problems. We show that our method reaches a lower sample complexity of $O(kappa2.5epsilon-3)$ with the mini-batch size $O(kappa)$.
arXiv Detail & Related papers (2021-06-30T14:47:09Z)
Variance Reduced EXTRA and DIGing and Their Optimal Acceleration for Strongly Convex Decentralized Optimization [69.49313819343992]
We extend the widely used EXTRA and DIGing methods with variance reduction (VR), and propose two methods: VR-EXTRA and VR-DIGing. The proposed VR-EXTRA requires the time of $O(kappa_s+n)logfrac1epsilon)$ gradient evaluations and $O(kappa_b+kappa_c)logfrac1epsilon)$ communication rounds. The proposed VR-DIGing has a little higher communication cost of $O(kappa_b+kappa
arXiv Detail & Related papers (2020-09-09T15:48:44Z)
Variance reduction for Riemannian non-convex optimization with batch size adaptation [36.79189106909088]
Variance experiments are popular accelerating batch descent techniques. We show that this strategy can achieve both lower complexities for total convergence functions under both finite-sum and online settings. Specifically, we prove that R-SRG bounds the same near as R-IDER without requiring a small step size.
arXiv Detail & Related papers (2020-07-03T04:34:39Z)
Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks. In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems. Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.