MSTGD:A Memory Stochastic sTratified Gradient Descent Method with an
Exponential Convergence Rate
- URL: http://arxiv.org/abs/2202.10923v1
- Date: Mon, 21 Feb 2022 01:36:26 GMT
- Title: MSTGD:A Memory Stochastic sTratified Gradient Descent Method with an
Exponential Convergence Rate
- Authors: Aixiang (Andy) Chen, Jinting Zhang, Zanbo Zhang, Zhihong Li
- Abstract summary: This paper designs a novel underlineMemory underlineStochastic sunderlineTratified Gradient Descend(underlineMSTGD) algorithm with an exponential convergence rate.
Specifically, MSTGD uses two strategies for variance reduction: the first strategy is to perform variance reduction according to the proportion p of used historical gradient.
Unlike most other algorithms that claim to achieve an exponential convergence rate, the convergence rate is independent of parameters such as dataset size N, batch size n, etc.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The fluctuation effect of gradient expectation and variance caused by
parameter update between consecutive iterations is neglected or confusing by
current mainstream gradient optimization algorithms.Using this fluctuation
effect, combined with the stratified sampling strategy, this paper designs a
novel \underline{M}emory \underline{S}tochastic s\underline{T}ratified Gradient
Descend(\underline{MST}GD) algorithm with an exponential convergence rate.
Specifically, MSTGD uses two strategies for variance reduction: the first
strategy is to perform variance reduction according to the proportion p of used
historical gradient, which is estimated from the mean and variance of sample
gradients before and after iteration, and the other strategy is stratified
sampling by category. The statistic \ $\bar{G}_{mst}$\ designed under these two
strategies can be adaptively unbiased, and its variance decays at a geometric
rate. This enables MSTGD based on $\bar{G}_{mst}$ to obtain an exponential
convergence rate of the form $\lambda^{2(k-k_0)}$($\lambda\in (0,1)$,k is the
number of iteration steps,$\lambda$ is a variable related to proportion
p).Unlike most other algorithms that claim to achieve an exponential
convergence rate, the convergence rate is independent of parameters such as
dataset size N, batch size n, etc., and can be achieved at a constant step
size.Theoretical and experimental results show the effectiveness of MSTGD
Related papers
- Double Variance Reduction: A Smoothing Trick for Composite Optimization Problems without First-Order Gradient [40.22217106270146]
Variance reduction techniques are designed to decrease the sampling variance, thereby accelerating convergence rates of first-order (FO) and zeroth-order (ZO) optimization methods.
In composite optimization problems, ZO methods encounter an additional variance called the coordinate-wise variance, which stems from the random estimation.
This paper proposes the Zeroth-order Proximal Double Variance Reduction (ZPDVR) method, which utilizes the averaging trick to reduce both sampling and coordinate-wise variances.
arXiv Detail & Related papers (2024-05-28T02:27:53Z) - Modified Step Size for Enhanced Stochastic Gradient Descent: Convergence
and Experiments [0.0]
This paper introduces a novel approach to the performance of the gradient descent (SGD) algorithm by incorporating a modified decay step size based on $frac1sqrttt.
The proposed step size integrates a logarithmic step term, leading to the selection of smaller values in the final iteration.
To the effectiveness of our approach, we conducted numerical experiments on image classification tasks using the FashionMNIST, andARAR datasets.
arXiv Detail & Related papers (2023-09-03T19:21:59Z) - $\bar{G}_{mst}$:An Unbiased Stratified Statistic and a Fast Gradient
Optimization Algorithm Based on It [0.0]
A novel algorithm named MSSG designed based on $barG_mst$ outperforms other sgd-like algorithms.
arXiv Detail & Related papers (2021-10-07T11:48:55Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - Fast Margin Maximization via Dual Acceleration [52.62944011696364]
We present and analyze a momentum-based method for training linear classifiers with an exponentially-tailed loss.
This momentum-based method is derived via the convex dual of the maximum-margin problem, and specifically by applying Nesterov acceleration to this dual.
arXiv Detail & Related papers (2021-07-01T16:36:39Z) - On the Convergence of Stochastic Extragradient for Bilinear Games with
Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence.
We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z) - A Variance Controlled Stochastic Method with Biased Estimation for
Faster Non-convex Optimization [0.0]
We propose a new technique, em variance controlled gradient (VCSG), to improve the performance of the reduced gradient (SVRG)
$lambda$ is introduced in VCSG to avoid over-reducing a variance by SVRG.
$mathcalO(min1/epsilon3/2,n1/4/epsilon)$ number of gradient evaluations, which improves the leading gradient complexity.
arXiv Detail & Related papers (2021-02-19T12:22:56Z) - Variance-Reduced Off-Policy TDC Learning: Non-Asymptotic Convergence
Analysis [27.679514676804057]
We develop a variance reduction scheme for the two time-scale TDC algorithm in the off-policy setting.
Experiments demonstrate that the proposed variance-reduced TDC achieves a smaller convergence error than both the conventional TDC and the variance-reduced TD.
arXiv Detail & Related papers (2020-10-26T01:33:05Z) - Private Stochastic Non-Convex Optimization: Adaptive Algorithms and
Tighter Generalization Bounds [72.63031036770425]
We propose differentially private (DP) algorithms for bound non-dimensional optimization.
We demonstrate two popular deep learning methods on the empirical advantages over standard gradient methods.
arXiv Detail & Related papers (2020-06-24T06:01:24Z) - On the Almost Sure Convergence of Stochastic Gradient Descent in
Non-Convex Problems [75.58134963501094]
This paper analyzes the trajectories of gradient descent (SGD)
We show that SGD avoids saddle points/manifolds with $1$ for strict step-size policies.
arXiv Detail & Related papers (2020-06-19T14:11:26Z) - Non-asymptotic Convergence of Adam-type Reinforcement Learning
Algorithms under Markovian Sampling [56.394284787780364]
This paper provides the first theoretical convergence analysis for two fundamental RL algorithms of policy gradient (PG) and temporal difference (TD) learning.
Under general nonlinear function approximation, PG-AMSGrad with a constant stepsize converges to a neighborhood of a stationary point at the rate of $mathcalO(log T/sqrtT)$.
Under linear function approximation, TD-AMSGrad with a constant stepsize converges to a neighborhood of the global optimum at the rate of $mathcalO(log T/sqrtT
arXiv Detail & Related papers (2020-02-15T00:26:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.