Federated Stochastic Gradient Langevin Dynamics
- URL: http://arxiv.org/abs/2004.11231v3
- Date: Mon, 14 Jun 2021 23:50:47 GMT
- Title: Federated Stochastic Gradient Langevin Dynamics
- Authors: Khaoula El Mekkaoui, Diego Mesquita, Paul Blomstedt, Samuel Kaski
- Abstract summary: gradient MCMC methods, such as gradient Langevin dynamics (SGLD), employ fast but noisy gradient estimates to enable large-scale posterior sampling.
We propose conducive gradients, a simple mechanism that combines local likelihood approximations to correct gradient updates.
We demonstrate that our approach can handle delayed communication rounds, converging to the target posterior in cases where DSGLD fails.
- Score: 12.180900849847252
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stochastic gradient MCMC methods, such as stochastic gradient Langevin
dynamics (SGLD), employ fast but noisy gradient estimates to enable large-scale
posterior sampling. Although we can easily extend SGLD to distributed settings,
it suffers from two issues when applied to federated non-IID data. First, the
variance of these estimates increases significantly. Second, delaying
communication causes the Markov chains to diverge from the true posterior even
for very simple models. To alleviate both these problems, we propose conducive
gradients, a simple mechanism that combines local likelihood approximations to
correct gradient updates. Notably, conducive gradients are easy to compute, and
since we only calculate the approximations once, they incur negligible
overhead. We apply conducive gradients to distributed stochastic gradient
Langevin dynamics (DSGLD) and call the resulting method federated stochastic
gradient Langevin dynamics (FSGLD). We demonstrate that our approach can handle
delayed communication rounds, converging to the target posterior in cases where
DSGLD fails. We also show that FSGLD outperforms DSGLD for non-IID federated
data with experiments on metric learning and neural networks.
Related papers
- Emergence of heavy tails in homogenized stochastic gradient descent [1.450405446885067]
Loss by gradient descent (SGD) leads to heavy-tailed network parameters.
We analyze a continuous diffusion approximation of SGD, called homogenized gradient descent.
We quantify the interplay between optimization parameters and the tail-index.
arXiv Detail & Related papers (2024-02-02T13:06:33Z) - Convergence of mean-field Langevin dynamics: Time and space
discretization, stochastic gradient, and variance reduction [49.66486092259376]
The mean-field Langevin dynamics (MFLD) is a nonlinear generalization of the Langevin dynamics that incorporates a distribution-dependent drift.
Recent works have shown that MFLD globally minimizes an entropy-regularized convex functional in the space of measures.
We provide a framework to prove a uniform-in-time propagation of chaos for MFLD that takes into account the errors due to finite-particle approximation, time-discretization, and gradient approximation.
arXiv Detail & Related papers (2023-06-12T16:28:11Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Phase diagram of Stochastic Gradient Descent in high-dimensional
two-layer neural networks [22.823904789355495]
We investigate the connection between the mean-fieldhydrodynamic regime and the seminal approach of Saad & Solla.
Our work builds on a deterministic description of rates in high-dimensionals from statistical physics.
arXiv Detail & Related papers (2022-02-01T09:45:07Z) - Vanishing Curvature and the Power of Adaptive Methods in Randomly
Initialized Deep Networks [30.467121747150816]
This paper revisits the so-called vanishing gradient phenomenon, which commonly occurs in deep randomly neural networks.
We first show that vanishing gradients cannot be circumvented when the network width scales with less than O(depth)
arXiv Detail & Related papers (2021-06-07T16:29:59Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Faster Convergence of Stochastic Gradient Langevin Dynamics for
Non-Log-Concave Sampling [110.88857917726276]
We provide a new convergence analysis of gradient Langevin dynamics (SGLD) for sampling from a class of distributions that can be non-log-concave.
At the core of our approach is a novel conductance analysis of SGLD using an auxiliary time-reversible Markov Chain.
arXiv Detail & Related papers (2020-10-19T15:23:18Z) - The Impact of the Mini-batch Size on the Variance of Gradients in
Stochastic Gradient Descent [28.148743710421932]
The mini-batch gradient descent (SGD) algorithm is widely used in training machine learning models.
We study SGD dynamics under linear regression and two-layer linear networks, with an easy extension to deeper linear networks.
arXiv Detail & Related papers (2020-04-27T20:06:11Z) - LASG: Lazily Aggregated Stochastic Gradients for Communication-Efficient
Distributed Learning [47.93365664380274]
This paper targets solving distributed machine learning problems such as federated learning in a communication-efficient fashion.
A class of new gradient descent (SGD) approaches have been developed, which can be viewed as a generalization to the recently developed lazily aggregated gradient (LAG) method.
The key components of LASG are a set of new rules tailored for gradients that can be implemented either to save download, upload, or both.
arXiv Detail & Related papers (2020-02-26T08:58:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.