Byzantine-Resilient SGD in High Dimensions on Heterogeneous Data
- URL: http://arxiv.org/abs/2005.07866v1
- Date: Sat, 16 May 2020 04:15:27 GMT
- Title: Byzantine-Resilient SGD in High Dimensions on Heterogeneous Data
- Authors: Deepesh Data and Suhas Diggavi
- Abstract summary: We study distributed gradient descent (SGD) in the master-worker architecture under Byzantine attacks.
Our algorithm can tolerate up to $frac14$ fraction Byzantine workers.
- Score: 10.965065178451104
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study distributed stochastic gradient descent (SGD) in the master-worker
architecture under Byzantine attacks. We consider the heterogeneous data model,
where different workers may have different local datasets, and we do not make
any probabilistic assumptions on data generation. At the core of our algorithm,
we use the polynomial-time outlier-filtering procedure for robust mean
estimation proposed by Steinhardt et al. (ITCS 2018) to filter-out corrupt
gradients. In order to be able to apply their filtering procedure in our {\em
heterogeneous} data setting where workers compute {\em stochastic} gradients,
we derive a new matrix concentration result, which may be of independent
interest.
We provide convergence analyses for smooth strongly-convex and non-convex
objectives. We derive our results under the bounded variance assumption on
local stochastic gradients and a {\em deterministic} condition on datasets,
namely, gradient dissimilarity; and for both these quantities, we provide
concrete bounds in the statistical heterogeneous data model. We give a
trade-off between the mini-batch size for stochastic gradients and the
approximation error. Our algorithm can tolerate up to $\frac{1}{4}$ fraction
Byzantine workers. It can find approximate optimal parameters in the
strongly-convex setting exponentially fast and reach to an approximate
stationary point in the non-convex setting with a linear speed, thus, matching
the convergence rates of vanilla SGD in the Byzantine-free setting.
We also propose and analyze a Byzantine-resilient SGD algorithm with gradient
compression, where workers send $k$ random coordinates of their gradients.
Under mild conditions, we show a $\frac{d}{k}$-factor saving in communication
bits as well as decoding complexity over our compression-free algorithm without
affecting its convergence rate (order-wise) and the approximation error.
Related papers
- Byzantine-resilient Federated Learning With Adaptivity to Data Heterogeneity [54.145730036889496]
This paper deals with Gradient learning (FL) in the presence of malicious attacks Byzantine data.
A novel Average Algorithm (RAGA) is proposed, which leverages robustness aggregation and can select a dataset.
arXiv Detail & Related papers (2024-03-20T08:15:08Z) - Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems [56.86067111855056]
We consider clipped optimization problems with heavy-tailed noise with structured density.
We show that it is possible to get faster rates of convergence than $mathcalO(K-(alpha - 1)/alpha)$, when the gradients have finite moments of order.
We prove that the resulting estimates have negligible bias and controllable variance.
arXiv Detail & Related papers (2023-11-07T17:39:17Z) - Robust Stochastic Optimization via Gradient Quantile Clipping [6.2844649973308835]
We introduce a quant clipping strategy for Gradient Descent (SGD)
We use gradient new outliers as norm clipping chains.
We propose an implementation of the algorithm using Huberiles.
arXiv Detail & Related papers (2023-09-29T15:24:48Z) - Federated Learning Using Variance Reduced Stochastic Gradient for
Probabilistically Activated Agents [0.0]
This paper proposes an algorithm for Federated Learning (FL) with a two-layer structure that achieves both variance reduction and a faster convergence rate to an optimal solution in the setting where each agent has an arbitrary probability of selection in each iteration.
arXiv Detail & Related papers (2022-10-25T22:04:49Z) - Convergence of First-Order Methods for Constrained Nonconvex
Optimization with Dependent Data [7.513100214864646]
We show the worst-case complexity of convergence $tildeO(t-1/4)$ and MoreautildeO(vareps-4)$ for smooth non- optimization.
We obtain first online nonnegative matrix factorization algorithms for dependent data based on projected gradient methods with adaptive step sizes and optimal convergence.
arXiv Detail & Related papers (2022-03-29T17:59:10Z) - Differentiable Annealed Importance Sampling and the Perils of Gradient
Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation.
Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective.
We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z) - Faster Convergence of Stochastic Gradient Langevin Dynamics for
Non-Log-Concave Sampling [110.88857917726276]
We provide a new convergence analysis of gradient Langevin dynamics (SGLD) for sampling from a class of distributions that can be non-log-concave.
At the core of our approach is a novel conductance analysis of SGLD using an auxiliary time-reversible Markov Chain.
arXiv Detail & Related papers (2020-10-19T15:23:18Z) - Byzantine-Resilient High-Dimensional Federated Learning [10.965065178451104]
We employ gradientCV descent (SGD) with local iterations in the presence of malicious/Byzantine clients.
The clients update iteration by taking their own subset of gradient vectors then communicate with the net.
We believe that our algorithm is the first Byzantine-resilient algorithm and analysis with local datasets.
arXiv Detail & Related papers (2020-06-22T03:54:36Z) - Carath\'eodory Sampling for Stochastic Gradient Descent [79.55586575988292]
We present an approach that is inspired by classical results of Tchakaloff and Carath'eodory about measure reduction.
We adaptively select the descent steps where the measure reduction is carried out.
We combine this with Block Coordinate Descent so that measure reduction can be done very cheaply.
arXiv Detail & Related papers (2020-06-02T17:52:59Z) - Improved guarantees and a multiple-descent curve for Column Subset
Selection and the Nystr\"om method [76.73096213472897]
We develop techniques which exploit spectral properties of the data matrix to obtain improved approximation guarantees.
Our approach leads to significantly better bounds for datasets with known rates of singular value decay.
We show that both our improved bounds and the multiple-descent curve can be observed on real datasets simply by varying the RBF parameter.
arXiv Detail & Related papers (2020-02-21T00:43:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.