Related papers: Parallel Momentum Methods Under Biased Gradient Estimations

Parallel Momentum Methods Under Biased Gradient Estimations

URL: http://arxiv.org/abs/2403.00853v2
Date: Wed, 16 Oct 2024 09:39:02 GMT
Title: Parallel Momentum Methods Under Biased Gradient Estimations
Authors: Ali Beikmohammadi, Sarit Khirirat, Sindri Magnússon,
Abstract summary: Parallel gradient methods are gaining prominence in solving large-scale machine learning problems that involve data distributed across multiple nodes. However, obtaining unbiased bounds, which have been the focus of most theoretical research, is challenging in many machine learning applications. In this paper we work out the implications for special gradient where estimates are biased, i.e. in meta-learning and when gradients are compressed or clipped.
Score: 11.074080383657453
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Parallel stochastic gradient methods are gaining prominence in solving large-scale machine learning problems that involve data distributed across multiple nodes. However, obtaining unbiased stochastic gradients, which have been the focus of most theoretical research, is challenging in many distributed machine learning applications. The gradient estimations easily become biased, for example, when gradients are compressed or clipped, when data is shuffled, and in meta-learning and reinforcement learning. In this work, we establish worst-case bounds on parallel momentum methods under biased gradient estimation on both general non-convex and $\mu$-PL problems. Our analysis covers general distributed optimization problems, and we work out the implications for special cases where gradient estimates are biased, i.e. in meta-learning and when the gradients are compressed or clipped. Our numerical experiments verify our theoretical findings and show faster convergence performance of momentum methods than traditional biased gradient descent.

Related papers

Dynamic Decoupling of Placid Terminal Attractor-based Gradient Descent Algorithm [56.06235614890066]
Gradient descent (GD) and gradient descent (SGD) have been widely used in a number of application domains. This paper carefully analyzes the dynamics of GD based on the terminal attractor at different stages of its gradient flow.
arXiv Detail & Related papers (2024-09-10T14:15:56Z)
Almost sure convergence rates of stochastic gradient methods under gradient domination [2.96614015844317]
Global and local gradient domination properties have shown to be a more realistic replacement of strong convexity. We prove almost sure convergence rates $f(X_n)-f*in obig( n-frac14beta-1+epsilonbig)$ of the last iterate for gradient descent. We show how to apply our results to the training task in both supervised and reinforcement learning.
arXiv Detail & Related papers (2024-05-22T12:40:57Z)
Non-asymptotic Analysis of Biased Adaptive Stochastic Approximation [0.8192907805418583]
We show that biased gradients converge to critical points for smooth non- functions. We show how the effect of bias can be reduced by appropriate tuning.
arXiv Detail & Related papers (2024-02-05T10:17:36Z)
Stochastic Gradient Descent for Gaussian Processes Done Right [86.83678041846971]
We show that when emphdone right -- by which we mean using specific insights from optimisation and kernel communities -- gradient descent is highly effective. We introduce a emphstochastic dual descent algorithm, explain its design in an intuitive manner and illustrate the design choices. Our method places Gaussian process regression on par with state-of-the-art graph neural networks for molecular binding affinity prediction.
arXiv Detail & Related papers (2023-10-31T16:15:13Z)
On the Benefits of Large Learning Rates for Kernel Methods [110.03020563291788]
We show that a phenomenon can be precisely characterized in the context of kernel methods. We consider the minimization of a quadratic objective in a separable Hilbert space, and show that with early stopping, the choice of learning rate influences the spectral decomposition of the obtained solution.
arXiv Detail & Related papers (2022-02-28T13:01:04Z)
Adapting Stepsizes by Momentumized Gradients Improves Optimization and Generalization [89.66571637204012]
textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing. textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing. textscAdaMomentum on vision, and achieves state-the-art results consistently on other tasks including language processing.
arXiv Detail & Related papers (2021-06-22T03:13:23Z)
Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of Stochasticity [24.428843425522107]
We study the dynamics of gradient descent over diagonal linear networks through its continuous time version, namely gradient flow. We show that the convergence speed of the training loss controls the magnitude of the biasing effect: the slower the convergence, the better the bias.
arXiv Detail & Related papers (2021-06-17T14:16:04Z)
Stability and Generalization of Stochastic Gradient Methods for Minimax Problems [71.60601421935844]
Many machine learning problems can be formulated as minimax problems such as Generative Adversarial Networks (GANs) We provide a comprehensive generalization analysis of examples from training gradient methods for minimax problems.
arXiv Detail & Related papers (2021-05-08T22:38:00Z)
Deep learning: a statistical viewpoint [120.94133818355645]
Deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-perfect solutions to non-optimal training problems. We conjecture that specific principles underlie these phenomena.
arXiv Detail & Related papers (2021-03-16T16:26:36Z)
Gradient Descent for Deep Matrix Factorization: Dynamics and Implicit Bias towards Low Rank [1.9350867959464846]
In deep learning, gradientdescent tends to prefer solutions which generalize well. In this paper we analyze the dynamics of gradient descent in the simplifiedsetting of linear networks and of an estimation problem.
arXiv Detail & Related papers (2020-11-27T15:08:34Z)
On the Convergence of SGD with Biased Gradients [28.400751656818215]
We analyze the guiding domain of biased gradient methods (SGD), where individual updates are corrupted by compression. We quantify how many magnitudes of bias accuracy and convergence rates are impacted.
arXiv Detail & Related papers (2020-07-31T19:37:59Z)
A Study of Gradient Variance in Deep Learning [56.437755740715396]
We introduce a method, Gradient Clustering, to minimize the variance of average mini-batch gradient with stratified sampling. We measure the gradient variance on common deep learning benchmarks and observe that, contrary to common assumptions, gradient variance increases during training.
arXiv Detail & Related papers (2020-07-09T03:23:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.