Convergence Analysis of alpha-SVRG under Strong Convexity
- URL: http://arxiv.org/abs/2503.12454v1
- Date: Sun, 16 Mar 2025 11:17:35 GMT
- Title: Convergence Analysis of alpha-SVRG under Strong Convexity
- Authors: Sean Xiao, Sangwoo Park, Stefan Vlaski,
- Abstract summary: variance-reduction technique, alpha-SVRG, allows for fine-grained control of residual noise in learning dynamics.<n>We show that alpha-SVRG has faster convergence rate compared to SGD and SVRG under suitable choice of alpha.
- Score: 17.360026829881487
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stochastic first-order methods for empirical risk minimization employ gradient approximations based on sampled data in lieu of exact gradients. Such constructions introduce noise into the learning dynamics, which can be corrected through variance-reduction techniques. There is increasing evidence in the literature that in many modern learning applications noise can have a beneficial effect on optimization and generalization. To this end, the recently proposed variance-reduction technique, alpha-SVRG [Yin et al., 2023] allows for fine-grained control of the level of residual noise in the learning dynamics, and has been reported to empirically outperform both SGD and SVRG in modern deep learning scenarios. By focusing on strongly convex environments, we first provide a unified convergence rate expression for alpha-SVRG under fixed learning rate, which reduces to that of either SGD or SVRG by setting alpha=0 or alpha=1, respectively. We show that alpha-SVRG has faster convergence rate compared to SGD and SVRG under suitable choice of alpha. Simulation results on linear regression validate our theory.
Related papers
- Gradient Normalization Provably Benefits Nonconvex SGD under Heavy-Tailed Noise [60.92029979853314]
We investigate the roles of gradient normalization and clipping in ensuring the convergence of Gradient Descent (SGD) under heavy-tailed noise.
Our work provides the first theoretical evidence demonstrating the benefits of gradient normalization in SGD under heavy-tailed noise.
We introduce an accelerated SGD variant incorporating gradient normalization and clipping, further enhancing convergence rates under heavy-tailed noise.
arXiv Detail & Related papers (2024-10-21T22:40:42Z) - A Coefficient Makes SVRG Effective [51.36251650664215]
Variance Reduced Gradient (SVRG) is a theoretically compelling optimization method.<n>In this work, we demonstrate the potential of SVRG in optimizing real-world neural networks.
arXiv Detail & Related papers (2023-11-09T18:47:44Z) - Butterfly Effects of SGD Noise: Error Amplification in Behavior Cloning
and Autoregression [70.78523583702209]
We study training instabilities of behavior cloning with deep neural networks.
We observe that minibatch SGD updates to the policy network during training result in sharp oscillations in long-horizon rewards.
arXiv Detail & Related papers (2023-10-17T17:39:40Z) - Closing the gap between SVRG and TD-SVRG with Gradient Splitting [17.071971639540976]
Temporal difference (TD) learning is a policy evaluation in reinforcement learning whose performance can be enhanced by variance reduction methods.
Recent work we utilize a recent interpretation of TD-learning as the splitting of the gradient of an appropriately chosen function, thus simplifying the algorithm and fusing TD with SVRG.
Our main result is a geometric convergence bound with predetermined learning rate of $1/8$, which is identical to the convergence bound available for SVRG in the convex setting.
arXiv Detail & Related papers (2022-11-29T14:21:34Z) - NAG-GS: Semi-Implicit, Accelerated and Robust Stochastic Optimizer [45.47667026025716]
We propose a novel, robust and accelerated iteration that relies on two key elements.
The convergence and stability of the obtained method, referred to as NAG-GS, are first studied extensively.
We show that NAG-arity is competitive with state-the-art methods such as momentum SGD with weight decay and AdamW for the training of machine learning models.
arXiv Detail & Related papers (2022-09-29T16:54:53Z) - Improving Covariance Conditioning of the SVD Meta-layer by Orthogonality [65.67315418971688]
Nearest Orthogonal Gradient (NOG) and Optimal Learning Rate (OLR) are proposed.
Experiments on visual recognition demonstrate that our methods can simultaneously improve the covariance conditioning and generalization.
arXiv Detail & Related papers (2022-07-05T15:39:29Z) - Clipped Stochastic Methods for Variational Inequalities with
Heavy-Tailed Noise [64.85879194013407]
We prove the first high-probability results with logarithmic dependence on the confidence level for methods for solving monotone and structured non-monotone VIPs.
Our results match the best-known ones in the light-tails case and are novel for structured non-monotone problems.
In addition, we numerically validate that the gradient noise of many practical formulations is heavy-tailed and show that clipping improves the performance of SEG/SGDA.
arXiv Detail & Related papers (2022-06-02T15:21:55Z) - Stochastic Second-Order Methods Provably Beat SGD For Gradient-Dominated
Functions [42.57892322514602]
We show that SCRN improves the best-known sample complexity of gradient descent by a factor of $mathcalO(epsilon-1/2)$.
We also show that the sample complexity of SCRN can be improved by a factor of $mathcalO(epsilon-1/2)$ using a variance reduction method with time-varying batch sizes.
arXiv Detail & Related papers (2022-05-25T15:33:00Z) - Optimizing Information-theoretical Generalization Bounds via Anisotropic
Noise in SGLD [73.55632827932101]
We optimize the information-theoretical generalization bound by manipulating the noise structure in SGLD.
We prove that with constraint to guarantee low empirical risk, the optimal noise covariance is the square root of the expected gradient covariance.
arXiv Detail & Related papers (2021-10-26T15:02:27Z) - Towards Understanding Label Smoothing [36.54164997035046]
Label smoothing regularization (LSR) has a great success in deep neural networks by training algorithms.
We show that an appropriate LSR can help to speed up convergence by reducing the variance.
We propose a simple yet effective strategy, namely Two-Stage LAbel smoothing algorithm (TSLA)
arXiv Detail & Related papers (2020-06-20T20:36:17Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.