Random-reshuffled SARAH does not need a full gradient computations
- URL: http://arxiv.org/abs/2111.13322v2
- Date: Sun, 14 Jan 2024 12:30:15 GMT
- Title: Random-reshuffled SARAH does not need a full gradient computations
- Authors: Aleksandr Beznosikov and Martin Tak\'a\v{c}
- Abstract summary: The StochAstic Recursive grAdientritHm (SARAH) algorithm is a variance reduced variant of the Gradient Descent (SGD) algorithm.
In this paper, we remove the necessity of a full gradient.
The aggregated gradients serve as an estimate of a full gradient in the SARAH algorithm.
- Score: 61.85897464405715
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The StochAstic Recursive grAdient algoritHm (SARAH) algorithm is a variance
reduced variant of the Stochastic Gradient Descent (SGD) algorithm that needs a
gradient of the objective function from time to time. In this paper, we remove
the necessity of a full gradient computation. This is achieved by using a
randomized reshuffling strategy and aggregating stochastic gradients obtained
in each epoch. The aggregated stochastic gradients serve as an estimate of a
full gradient in the SARAH algorithm. We provide a theoretical analysis of the
proposed approach and conclude the paper with numerical experiments that
demonstrate the efficiency of this approach.
Related papers
- An Accelerated Algorithm for Stochastic Bilevel Optimization under Unbounded Smoothness [15.656614304616006]
This paper investigates a class of bilevel optimization problems where the upper-level function is non- unbounded smoothness and the lower-level problem is strongly convex.
These problems have significant applications in data learning, such as text classification using neural networks.
arXiv Detail & Related papers (2024-09-28T02:30:44Z) - Forward Gradient-Based Frank-Wolfe Optimization for Memory Efficient Deep Neural Network Training [0.0]
This paper focuses on analyzing the performance of the well-known Frank-Wolfe algorithm.
We show the proposed algorithm does converge to the optimal solution with a sub-linear rate of convergence.
In contrast, the standard Frank-Wolfe algorithm, when provided with access to the Projected Forward Gradient, fails to converge to the optimal solution.
arXiv Detail & Related papers (2024-03-19T07:25:36Z) - One-step corrected projected stochastic gradient descent for statistical estimation [49.1574468325115]
It is based on the projected gradient descent on the log-likelihood function corrected by a single step of the Fisher scoring algorithm.
We show theoretically and by simulations that it is an interesting alternative to the usual gradient descent with averaging or the adaptative gradient descent.
arXiv Detail & Related papers (2023-06-09T13:43:07Z) - Non asymptotic analysis of Adaptive stochastic gradient algorithms and
applications [0.0]
This paper is devoted to the non analyis of these adaptive gradient algorithms for strongly convex objective.
All the theoretical results will be adapted to linear regression and regularized generalized linear model for both Adagrad algorithm and Newton algorithms.
arXiv Detail & Related papers (2023-03-01T07:36:03Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - Zeroth-Order Hybrid Gradient Descent: Towards A Principled Black-Box
Optimization Framework [100.36569795440889]
This work is on the iteration of zero-th-order (ZO) optimization which does not require first-order information.
We show that with a graceful design in coordinate importance sampling, the proposed ZO optimization method is efficient both in terms of complexity as well as as function query cost.
arXiv Detail & Related papers (2020-12-21T17:29:58Z) - Multi-kernel Passive Stochastic Gradient Algorithms and Transfer
Learning [21.796874356469644]
The gradient algorithm does not have control over the location where noisy gradients of the cost function are evaluated.
The algorithm performs substantially better in high dimensional problems and incorporates variance reduction.
arXiv Detail & Related papers (2020-08-23T11:55:19Z) - Variance Reduction for Deep Q-Learning using Stochastic Recursive
Gradient [51.880464915253924]
Deep Q-learning algorithms often suffer from poor gradient estimations with an excessive variance.
This paper introduces the framework for updating the gradient estimates in deep Q-learning, achieving a novel algorithm called SRG-DQN.
arXiv Detail & Related papers (2020-07-25T00:54:20Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.