Scaling the Wild: Decentralizing Hogwild!-style Shared-memory SGD
- URL: http://arxiv.org/abs/2203.06638v1
- Date: Sun, 13 Mar 2022 11:52:24 GMT
- Title: Scaling the Wild: Decentralizing Hogwild!-style Shared-memory SGD
- Authors: Bapi Chatterjee and Vyacheslav Kungurtsev and Dan Alistarh
- Abstract summary: Hogwilld! is a go-to approach to parallelize SGD over a shared-memory setting.
In this paper, we propose incorporating decentralized distributed memory with each node running parallel shared-memory SGD itself.
- Score: 29.6870062491741
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Powered by the simplicity of lock-free asynchrony, Hogwilld! is a go-to
approach to parallelize SGD over a shared-memory setting. Despite its
popularity and concomitant extensions, such as PASSM+ wherein concurrent
processes update a shared model with partitioned gradients, scaling it to
decentralized workers has surprisingly been relatively unexplored. To our
knowledge, there is no convergence theory of such methods, nor systematic
numerical comparisons evaluating speed-up.
In this paper, we propose an algorithm incorporating decentralized
distributed memory computing architecture with each node running
multiprocessing parallel shared-memory SGD itself. Our scheme is based on the
following algorithmic tools and features: (a) asynchronous local gradient
updates on the shared-memory of workers, (b) partial backpropagation, and (c)
non-blocking in-place averaging of the local models. We prove that our method
guarantees ergodic convergence rates for non-convex objectives. On the
practical side, we show that the proposed method exhibits improved throughput
and competitive accuracy for standard image classification benchmarks on the
CIFAR-10, CIFAR-100, and Imagenet datasets. Our code is available at
https://github.com/bapi/LPP-SGD.
Related papers
- Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss [59.835032408496545]
We propose a tile-based strategy that partitions the contrastive loss calculation into arbitrary small blocks.
We also introduce a multi-level tiling strategy to leverage the hierarchical structure of distributed systems.
Compared to SOTA memory-efficient solutions, it achieves a two-order-of-magnitude reduction in memory while maintaining comparable speed.
arXiv Detail & Related papers (2024-10-22T17:59:30Z) - Asynchronous Stochastic Gradient Descent with Decoupled Backpropagation and Layer-Wise Updates [1.9241821314180372]
One major shortcoming of backpropagation is the interlocking between the forward and backward phases of the algorithm.
We propose a method that parallelises SGD updates across the layers of a model by asynchronously updating them from multiple threads.
We show that this approach yields close to state-of-theart results while running up to 2.97x faster than Hogwild! scaled on multiple devices.
arXiv Detail & Related papers (2024-10-08T12:32:36Z) - AsGrad: A Sharp Unified Analysis of Asynchronous-SGD Algorithms [45.90015262911875]
We analyze asynchronous-type algorithms for distributed SGD in the heterogeneous setting.
As a by-product of our analysis, we also demonstrate guarantees for gradient-type algorithms such as SGD with random tightness.
arXiv Detail & Related papers (2023-10-31T13:44:53Z) - ByzSecAgg: A Byzantine-Resistant Secure Aggregation Scheme for Federated
Learning Based on Coded Computing and Vector Commitment [90.60126724503662]
ByzSecAgg is an efficient secure aggregation scheme for federated learning.
ByzSecAgg is protected against Byzantine attacks and privacy leakages.
arXiv Detail & Related papers (2023-02-20T11:15:18Z) - Recall@k Surrogate Loss with Large Batches and Similarity Mixup [62.67458021725227]
Direct optimization, by gradient descent, of an evaluation metric is not possible when it is non-differentiable.
In this work, a differentiable surrogate loss for the recall is proposed.
The proposed method achieves state-of-the-art results in several image retrieval benchmarks.
arXiv Detail & Related papers (2021-08-25T11:09:11Z) - Gradient Coding with Dynamic Clustering for Straggler-Tolerant
Distributed Learning [55.052517095437]
gradient descent (GD) is widely employed to parallelize the learning task by distributing the dataset across multiple workers.
A significant performance bottleneck for the per-iteration completion time in distributed synchronous GD is $straggling$ workers.
Coded distributed techniques have been introduced recently to mitigate stragglers and to speed up GD iterations by assigning redundant computations to workers.
We propose a novel dynamic GC scheme, which assigns redundant data to workers to acquire the flexibility to choose from among a set of possible codes depending on the past straggling behavior.
arXiv Detail & Related papers (2021-03-01T18:51:29Z) - Sparse Communication for Training Deep Networks [56.441077560085475]
Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models.
In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers.
We study several compression schemes and identify how three key parameters affect the performance.
arXiv Detail & Related papers (2020-09-19T17:28:11Z) - DaSGD: Squeezing SGD Parallelization Performance in Distributed Training
Using Delayed Averaging [4.652668321425679]
Minibatch gradient descent (SGD) algorithm requires workers to halt forward/back propagations.
DaSGD parallelizes SGD and forward/back propagations to hide 100% of the communication overhead.
arXiv Detail & Related papers (2020-05-31T05:43:50Z) - Elastic Consistency: A General Consistency Model for Distributed
Stochastic Gradient Descent [28.006781039853575]
A key element behind the progress of machine learning in recent years has been the ability to train machine learning models in largescale distributed-memory environments.
In this paper, we introduce general convergence methods used in practice to train large-scale machine learning models.
Our framework, called elastic elastic bounds, enables us to derive convergence bounds for a variety of distributed SGD methods.
arXiv Detail & Related papers (2020-01-16T16:10:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.