rTop-k: A Statistical Estimation Approach to Distributed SGD
- URL: http://arxiv.org/abs/2005.10761v2
- Date: Wed, 2 Dec 2020 21:26:06 GMT
- Title: rTop-k: A Statistical Estimation Approach to Distributed SGD
- Authors: Leighton Pate Barnes, Huseyin A. Inan, Berivan Isik, and Ayfer Ozgur
- Abstract summary: We show that top-k and random-k sparsification methods consistently and significantly outperforms either method applied alone.
We propose a simple statistical estimation model for gradients which captures the sparsity and statistically optimal communication scheme.
We show through extensive experiments on both image and language domains with CIFAR-10, ImageNet, and Penn Treebank datasets that the skewd application of these two sparsification methods consistently and significantly outperforms either method applied alone.
- Score: 5.197307534263253
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The large communication cost for exchanging gradients between different nodes
significantly limits the scalability of distributed training for large-scale
learning models. Motivated by this observation, there has been significant
recent interest in techniques that reduce the communication cost of distributed
Stochastic Gradient Descent (SGD), with gradient sparsification techniques such
as top-k and random-k shown to be particularly effective. The same observation
has also motivated a separate line of work in distributed statistical
estimation theory focusing on the impact of communication constraints on the
estimation efficiency of different statistical models. The primary goal of this
paper is to connect these two research lines and demonstrate how statistical
estimation models and their analysis can lead to new insights in the design of
communication-efficient training techniques. We propose a simple statistical
estimation model for the stochastic gradients which captures the sparsity and
skewness of their distribution. The statistically optimal communication scheme
arising from the analysis of this model leads to a new sparsification technique
for SGD, which concatenates random-k and top-k, considered separately in the
prior literature. We show through extensive experiments on both image and
language domains with CIFAR-10, ImageNet, and Penn Treebank datasets that the
concatenated application of these two sparsification methods consistently and
significantly outperforms either method applied alone.
Related papers
- On conditional diffusion models for PDE simulations [53.01911265639582]
We study score-based diffusion models for forecasting and assimilation of sparse observations.
We propose an autoregressive sampling approach that significantly improves performance in forecasting.
We also propose a new training strategy for conditional score-based models that achieves stable performance over a range of history lengths.
arXiv Detail & Related papers (2024-10-21T18:31:04Z) - MITA: Bridging the Gap between Model and Data for Test-time Adaptation [68.62509948690698]
Test-Time Adaptation (TTA) has emerged as a promising paradigm for enhancing the generalizability of models.
We propose Meet-In-The-Middle based MITA, which introduces energy-based optimization to encourage mutual adaptation of the model and data from opposing directions.
arXiv Detail & Related papers (2024-10-12T07:02:33Z) - Accelerated Stochastic ExtraGradient: Mixing Hessian and Gradient Similarity to Reduce Communication in Distributed and Federated Learning [50.382793324572845]
Distributed computing involves communication between devices, which requires solving two key problems: efficiency and privacy.
In this paper, we analyze a new method that incorporates the ideas of using data similarity and clients sampling.
To address privacy concerns, we apply the technique of additional noise and analyze its impact on the convergence of the proposed method.
arXiv Detail & Related papers (2024-09-22T00:49:10Z) - Reducing Spurious Correlation for Federated Domain Generalization [15.864230656989854]
In open-world scenarios, global models may struggle to predict well on entirely new domain data captured by certain media.
Existing methods still rely on strong statistical correlations between samples and labels to address this issue.
We introduce FedCD, an overall optimization framework at both the local and global levels.
arXiv Detail & Related papers (2024-07-27T05:06:31Z) - Aggregation Weighting of Federated Learning via Generalization Bound
Estimation [65.8630966842025]
Federated Learning (FL) typically aggregates client model parameters using a weighting approach determined by sample proportions.
We replace the aforementioned weighting method with a new strategy that considers the generalization bounds of each local model.
arXiv Detail & Related papers (2023-11-10T08:50:28Z) - A Bayesian Methodology for Estimation for Sparse Canonical Correlation [0.0]
Canonical Correlation Analysis (CCA) is a statistical procedure for identifying relationships between data sets.
ScSCCA is a rapidly emerging methodological area that aims for robust modeling of the interrelations between the different data modalities.
We propose a novel ScSCCA approach where we employ a Bayesian infinite factor model and aim to achieve robust estimation.
arXiv Detail & Related papers (2023-10-30T15:14:25Z) - Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST)
IST is a recently proposed and highly effective technique for solving the aforementioned problems.
We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z) - Depersonalized Federated Learning: Tackling Statistical Heterogeneity by
Alternating Stochastic Gradient Descent [6.394263208820851]
Federated learning (FL) enables devices to train a common machine learning (ML) model for intelligent inference without data sharing.
Raw data held by various cooperativelyicipators are always non-identically distributedly.
We propose a new FL that can significantly statistical optimize by the de-speed of this process.
arXiv Detail & Related papers (2022-10-07T10:30:39Z) - Fundamental Limits of Communication Efficiency for Model Aggregation in
Distributed Learning: A Rate-Distortion Approach [54.311495894129585]
We study the limit of communication cost of model aggregation in distributed learning from a rate-distortion perspective.
It is found that the communication gain by exploiting the correlation between worker nodes is significant for SignSGD.
arXiv Detail & Related papers (2022-06-28T13:10:40Z) - Achieving Efficiency in Black Box Simulation of Distribution Tails with
Self-structuring Importance Samplers [1.6114012813668934]
The paper presents a novel Importance Sampling (IS) scheme for estimating distribution of performance measures modeled with a rich set of tools such as linear programs, integer linear programs, piecewise linear/quadratic objectives, feature maps specified with deep neural networks, etc.
arXiv Detail & Related papers (2021-02-14T03:37:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.