Federated Learning Using Variance Reduced Stochastic Gradient for
Probabilistically Activated Agents
- URL: http://arxiv.org/abs/2210.14362v2
- Date: Sat, 1 Apr 2023 20:13:41 GMT
- Title: Federated Learning Using Variance Reduced Stochastic Gradient for
Probabilistically Activated Agents
- Authors: M. R. Rostami, S. S. Kia
- Abstract summary: This paper proposes an algorithm for Federated Learning (FL) with a two-layer structure that achieves both variance reduction and a faster convergence rate to an optimal solution in the setting where each agent has an arbitrary probability of selection in each iteration.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes an algorithm for Federated Learning (FL) with a two-layer
structure that achieves both variance reduction and a faster convergence rate
to an optimal solution in the setting where each agent has an arbitrary
probability of selection in each iteration. In distributed machine learning,
when privacy matters, FL is a functional tool. Placing FL in an environment
where it has some irregular connections of agents (devices), reaching a trained
model in both an economical and quick way can be a demanding job. The first
layer of our algorithm corresponds to the model parameter propagation across
agents done by the server. In the second layer, each agent does its local
update with a stochastic and variance-reduced technique called Stochastic
Variance Reduced Gradient (SVRG). We leverage the concept of variance reduction
from stochastic optimization when the agents want to do their local update step
to reduce the variance caused by stochastic gradient descent (SGD). We provide
a convergence bound for our algorithm which improves the rate from
$O(\frac{1}{\sqrt{K}})$ to $O(\frac{1}{K})$ by using a constant step-size. We
demonstrate the performance of our algorithm using numerical examples.
Related papers
- Towards Hyper-parameter-free Federated Learning [1.3682156035049038]
We introduce algorithms for automated scaling of global model updates.
In first algorithm, we establish that a descent-ensuring step-size regime at the clients ensures descent for the server objective.
Second algorithm shows that the average of objective values of sampled clients is a practical and effective substitute for the value server required for computing the scaling factor.
arXiv Detail & Related papers (2024-08-30T09:35:36Z) - DASA: Delay-Adaptive Multi-Agent Stochastic Approximation [64.32538247395627]
We consider a setting in which $N$ agents aim to speedup a common Approximation problem by acting in parallel and communicating with a central server.
To mitigate the effect of delays and stragglers, we propose textttDASA, a Delay-Adaptive algorithm for multi-agent Approximation.
arXiv Detail & Related papers (2024-03-25T22:49:56Z) - Rate-Distortion Theoretic Bounds on Generalization Error for Distributed
Learning [9.00236182523638]
In this paper, we use tools from rate-distortion theory to establish new upper bounds on the generalization error of statistical distributed learning algorithms.
The bounds depend on the compressibility of each client's algorithm while keeping other clients' algorithms un-compressed.
arXiv Detail & Related papers (2022-06-06T13:21:52Z) - An Improved Analysis of Gradient Tracking for Decentralized Machine
Learning [34.144764431505486]
We consider decentralized machine learning over a network where the training data is distributed across $n$ agents.
The agent's common goal is to find a model that minimizes the average of all local loss functions.
We improve the dependency on $p$ from $mathcalO(p-1)$ to $mathcalO(p-1)$ in the noiseless case.
arXiv Detail & Related papers (2022-02-08T12:58:14Z) - Decentralized Stochastic Proximal Gradient Descent with Variance
Reduction over Time-varying Networks [30.231314171218994]
In decentralized learning, a network of nodes cooperate to minimize an overall objective function that is usually the finite-sum of their local objectives.
We propose a novel algorithm, namely DPSVRG, to accelerate the decentralized training by leveraging the variance reduction technique.
arXiv Detail & Related papers (2021-12-20T08:23:36Z) - Scaling Structured Inference with Randomization [64.18063627155128]
We propose a family of dynamic programming (RDP) randomized for scaling structured models to tens of thousands of latent states.
Our method is widely applicable to classical DP-based inference.
It is also compatible with automatic differentiation so can be integrated with neural networks seamlessly.
arXiv Detail & Related papers (2021-12-07T11:26:41Z) - Byzantine-Resilient Non-Convex Stochastic Gradient Descent [61.6382287971982]
adversary-resilient distributed optimization, in which.
machines can independently compute gradients, and cooperate.
Our algorithm is based on a new concentration technique, and its sample complexity.
It is very practical: it improves upon the performance of all prior methods when no.
setting machines are present.
arXiv Detail & Related papers (2020-12-28T17:19:32Z) - Balancing Rates and Variance via Adaptive Batch-Size for Stochastic
Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error.
Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z) - Private Stochastic Non-Convex Optimization: Adaptive Algorithms and
Tighter Generalization Bounds [72.63031036770425]
We propose differentially private (DP) algorithms for bound non-dimensional optimization.
We demonstrate two popular deep learning methods on the empirical advantages over standard gradient methods.
arXiv Detail & Related papers (2020-06-24T06:01:24Z) - Adaptive Distributed Stochastic Gradient Descent for Minimizing Delay in
the Presence of Stragglers [31.309253646602933]
We consider the setting where a master wants to run a distributed gradient descent (SGD) algorithm on $n$ workers each having a subset of the data.
Distributed SGD may suffer from the effect of stragglers, i.e., slow or unresponsive workers who cause delays.
We propose an algorithm for adaptive distributed SGD that is based on a statistical notion.
arXiv Detail & Related papers (2020-02-25T16:25:22Z) - Variance Reduction with Sparse Gradients [82.41780420431205]
Variance reduction methods such as SVRG and SpiderBoost use a mixture of large and small batch gradients.
We introduce a new sparsity operator: The random-top-k operator.
Our algorithm consistently outperforms SpiderBoost on various tasks including image classification, natural language processing, and sparse matrix factorization.
arXiv Detail & Related papers (2020-01-27T08:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.