Distributed Momentum Methods Under Biased Gradient Estimations
- URL: http://arxiv.org/abs/2403.00853v1
- Date: Thu, 29 Feb 2024 18:03:03 GMT
- Title: Distributed Momentum Methods Under Biased Gradient Estimations
- Authors: Ali Beikmohammadi, Sarit Khirirat, Sindri Magn\'usson
- Abstract summary: Distributed gradient methods are gaining prominence in solving large-scale machine learning problems that involve data distributed across multiple nodes.
However, obtaining unbiased gradient estimates is challenging in many distributed machine learning applications.
In this paper we establish non-asymotic convergence bounds on distributed momentum methods under biased gradient estimation.
- Score: 6.046591474843391
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Distributed stochastic gradient methods are gaining prominence in solving
large-scale machine learning problems that involve data distributed across
multiple nodes. However, obtaining unbiased stochastic gradients, which have
been the focus of most theoretical research, is challenging in many distributed
machine learning applications. The gradient estimations easily become biased,
for example, when gradients are compressed or clipped, when data is shuffled,
and in meta-learning and reinforcement learning. In this work, we establish
non-asymptotic convergence bounds on distributed momentum methods under biased
gradient estimation on both general non-convex and $\mu$-PL non-convex
problems. Our analysis covers general distributed optimization problems, and we
work out the implications for special cases where gradient estimates are
biased, i.e., in meta-learning and when the gradients are compressed or
clipped. Our numerical experiments on training deep neural networks with
Top-$K$ sparsification and clipping verify faster convergence performance of
momentum methods than traditional biased gradient descent.
Related papers
- Almost sure convergence rates of stochastic gradient methods under gradient domination [2.96614015844317]
Global and local gradient domination properties have shown to be a more realistic replacement of strong convexity.
We prove almost sure convergence rates $f(X_n)-f*in obig( n-frac14beta-1+epsilonbig)$ of the last iterate for gradient descent.
We show how to apply our results to the training task in both supervised and reinforcement learning.
arXiv Detail & Related papers (2024-05-22T12:40:57Z) - Flattened one-bit stochastic gradient descent: compressed distributed optimization with controlled variance [55.01966743652196]
We propose a novel algorithm for distributed gradient descent (SGD) with compressed gradient communication in the parameter-server framework.
Our gradient compression technique, named flattened one-bit gradient descent (FO-SGD), relies on two simple algorithmic ideas.
arXiv Detail & Related papers (2024-05-17T21:17:27Z) - Non-asymptotic Analysis of Biased Adaptive Stochastic Approximation [0.8192907805418583]
We show that biased gradients converge to critical points for smooth non- functions.
We show how the effect of bias can be reduced by appropriate tuning.
arXiv Detail & Related papers (2024-02-05T10:17:36Z) - Stochastic Gradient Descent for Gaussian Processes Done Right [86.83678041846971]
We show that when emphdone right -- by which we mean using specific insights from optimisation and kernel communities -- gradient descent is highly effective.
We introduce a emphstochastic dual descent algorithm, explain its design in an intuitive manner and illustrate the design choices.
Our method places Gaussian process regression on par with state-of-the-art graph neural networks for molecular binding affinity prediction.
arXiv Detail & Related papers (2023-10-31T16:15:13Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of
Stochasticity [24.428843425522107]
We study the dynamics of gradient descent over diagonal linear networks through its continuous time version, namely gradient flow.
We show that the convergence speed of the training loss controls the magnitude of the biasing effect: the slower the convergence, the better the bias.
arXiv Detail & Related papers (2021-06-17T14:16:04Z) - Deep learning: a statistical viewpoint [120.94133818355645]
Deep learning has revealed some major surprises from a theoretical perspective.
In particular, simple gradient methods easily find near-perfect solutions to non-optimal training problems.
We conjecture that specific principles underlie these phenomena.
arXiv Detail & Related papers (2021-03-16T16:26:36Z) - Gradient Descent for Deep Matrix Factorization: Dynamics and Implicit
Bias towards Low Rank [1.9350867959464846]
In deep learning, gradientdescent tends to prefer solutions which generalize well.
In this paper we analyze the dynamics of gradient descent in the simplifiedsetting of linear networks and of an estimation problem.
arXiv Detail & Related papers (2020-11-27T15:08:34Z) - On the Convergence of SGD with Biased Gradients [28.400751656818215]
We analyze the guiding domain of biased gradient methods (SGD), where individual updates are corrupted by compression.
We quantify how many magnitudes of bias accuracy and convergence rates are impacted.
arXiv Detail & Related papers (2020-07-31T19:37:59Z) - A Study of Gradient Variance in Deep Learning [56.437755740715396]
We introduce a method, Gradient Clustering, to minimize the variance of average mini-batch gradient with stratified sampling.
We measure the gradient variance on common deep learning benchmarks and observe that, contrary to common assumptions, gradient variance increases during training.
arXiv Detail & Related papers (2020-07-09T03:23:10Z) - Path Sample-Analytic Gradient Estimators for Stochastic Binary Networks [78.76880041670904]
In neural networks with binary activations and or binary weights the training by gradient descent is complicated.
We propose a new method for this estimation problem combining sampling and analytic approximation steps.
We experimentally show higher accuracy in gradient estimation and demonstrate a more stable and better performing training in deep convolutional models.
arXiv Detail & Related papers (2020-06-04T21:51:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.